statwiki - User contributions [US]

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:55:43Z

A2prasad: /* References */

== Introduction ==
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.

This paper proposes a method for unsupervised domain adaptation which relies on three key components:
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data;
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.

This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.

[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]]
=== Working Example (Office-31) ===
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.

One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).

== Related Work ==
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.

Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains.

The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.

== Residual Transfer Networks ==
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values.
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods.

We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.

The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:

<center>
<math display="block">
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)
</math>
</center>

In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.

=== Structural Overview ===
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:

# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.
# A bottleneck layer, used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.
# An MMD block, with the expressed intention of feature adaptation.
# A residual block, with the expressed intention of classifier adaptation.

This structure is then optimized against a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.

=== Feature Adaptation ===
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.

In particular the authors of this paper impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.

[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]

==== Maximum Mean Discrepancy ====
The Maximum Mean Discrepancy (MMD) is a Kernel method involes mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:

<center>
<math display="block">
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}
</math>
</center>

Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.

Two important notes:
# The RKHS, and as such MMD, depend on the choice of the kernel;
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).

==== MMD for Feature Adaptation in the RTN ====
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.

Practically this involves an additional penalty function given by the following:

<center>
<math display="block">
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} + \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t}
</math>
</center>

Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.

=== Classifier Adaptation ===
In traditional unsupervised domain adaptation there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function.

The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.

[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]
==== Residual Networks Framework ====
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process, and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself.

That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained, but has since used in novel contexts.

==== Residual Blocks in the RTN ====
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.

It is necessary to note that in this case the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.

==== Entropy Minimization ====
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.

In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:

<center>
<math display="block">
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))
</math>
</center>

The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.

=== Residual Transfer Network ===
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:

<center>
<math display="block">
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}
</math>
</center>

Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty.

The full network, which is trained subject to the above optimization problem, thus takes on the following structure.

[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]

== Experiments ==

=== Set-up ===
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).

The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks.

As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.

=== Results ===
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.

In addition, the ablation study found a number of interesting results:
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.

Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.

=== Discussion ===
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]]
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]]
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]

==== Visualizing Predictions (Versus DAN) ====
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.

==== Layer Responses and Classifier Shift ====
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers.

In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption.

==== Parameter Sensitivity ====
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.

== Conclusion ==
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Like previous models this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier. In particular, this approach allows for easy integration into existing networks, and can be implemented with any standard deep learning software.

For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.

== Critique ==
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach.

==References==
# https://en.wikipedia.org/wiki/Domain_adaptation
# https://people.eecs.berkeley.edu/~jhoffman/domainadapt/
# Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
# Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474 (2014).
# Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.
# Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised domain adaptation by backpropagation." International Conference on Machine Learning. 2015.
# Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." Proceedings of the IEEE International Conference on Computer Vision. 2015.
# He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
# Yang, Jun, Rong Yan, and Alexander G. Hauptmann. "Cross-domain video concept detection using adaptive svms." Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007.
# Duan, Lixin, et al. "Domain adaptation from multiple sources via auxiliary classifiers." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
# Duan, Lixin, et al. "Visual event recognition in videos by learning from web data." IEEE Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012): 1667-1680.
# http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf
# https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space
#Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.
#He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
# Grandvalet, Yves, and Yoshua Bengio. "Semi-supervised learning by entropy minimization." Advances in neural information processing systems. 2005.

Expert review from the NIPS community can be found in https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/99.html.

Implementation Example: https://github.com/thuml/Xlearn

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:53:18Z

A2prasad: /* References */

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:52:41Z

A2prasad: /* References */

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:51:13Z

A2prasad: /* References */

== Introduction ==
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.

This paper proposes a method for unsupervised domain adaptation which relies on three key components:
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data;
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.

This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.

[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]]
=== Working Example (Office-31) ===
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.

One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).

== Related Work ==
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.

Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains.

The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.

== Residual Transfer Networks ==
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values.
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods.

We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.

The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:

<center>
<math display="block">
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)
</math>
</center>

In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.

=== Structural Overview ===
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:

# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.
# A bottleneck layer, used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.
# An MMD block, with the expressed intention of feature adaptation.
# A residual block, with the expressed intention of classifier adaptation.

This structure is then optimized against a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.

=== Feature Adaptation ===
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.

In particular the authors of this paper impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.

[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]

==== Maximum Mean Discrepancy ====
The Maximum Mean Discrepancy (MMD) is a Kernel method involes mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:

<center>
<math display="block">
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}
</math>
</center>

Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.

Two important notes:
# The RKHS, and as such MMD, depend on the choice of the kernel;
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).

==== MMD for Feature Adaptation in the RTN ====
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.

Practically this involves an additional penalty function given by the following:

<center>
<math display="block">
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} + \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t}
</math>
</center>

Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.

=== Classifier Adaptation ===
In traditional unsupervised domain adaptation there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function.

The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.

[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]
==== Residual Networks Framework ====
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process, and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself.

That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained, but has since used in novel contexts.

==== Residual Blocks in the RTN ====
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.

It is necessary to note that in this case the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.

==== Entropy Minimization ====
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.

In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:

<center>
<math display="block">
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))
</math>
</center>

The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.

=== Residual Transfer Network ===
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:

<center>
<math display="block">
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}
</math>
</center>

Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty.

The full network, which is trained subject to the above optimization problem, thus takes on the following structure.

[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]

== Experiments ==

=== Set-up ===
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).

The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks.

As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.

=== Results ===
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.

In addition, the ablation study found a number of interesting results:
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.

Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.

=== Discussion ===
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]]
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]]
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]

==== Visualizing Predictions (Versus DAN) ====
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.

==== Layer Responses and Classifier Shift ====
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers.

In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption.

==== Parameter Sensitivity ====
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.

== Conclusion ==
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Like previous models this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier. In particular, this approach allows for easy integration into existing networks, and can be implemented with any standard deep learning software.

For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.

== Critique ==
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach.

==References==
# https://en.wikipedia.org/wiki/Domain_adaptation
# https://people.eecs.berkeley.edu/~jhoffman/domainadapt/
# Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
# Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474 (2014).
# Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.
# Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised domain adaptation by backpropagation." International Conference on Machine Learning. 2015.
# Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." Proceedings of the IEEE International Conference on Computer Vision. 2015.
# He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
# Yang, Jun, Rong Yan, and Alexander G. Hauptmann. "Cross-domain video concept detection using adaptive svms." Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007.
# Duan, Lixin, et al. "Domain adaptation from multiple sources via auxiliary classifiers." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
# Duan, Lixin, et al. "Visual event recognition in videos by learning from web data." IEEE Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012): 1667-1680.
# http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf
# https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space
#He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
# Grandvalet, Yves, and Yoshua Bengio. "Semi-supervised learning by entropy minimization." Advances in neural information processing systems. 2005.

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:51:00Z

A2prasad: /* References */

== Introduction ==
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.

This paper proposes a method for unsupervised domain adaptation which relies on three key components:
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data;
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.

This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.

[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]]
=== Working Example (Office-31) ===
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.

One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).

== Related Work ==
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.

Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains.

The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.

== Residual Transfer Networks ==
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values.
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods.

We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.

The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:

<center>
<math display="block">
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)
</math>
</center>

In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.

=== Structural Overview ===
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:

# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.
# A bottleneck layer, used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.
# An MMD block, with the expressed intention of feature adaptation.
# A residual block, with the expressed intention of classifier adaptation.

This structure is then optimized against a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.

=== Feature Adaptation ===
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.

In particular the authors of this paper impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.

[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]

==== Maximum Mean Discrepancy ====
The Maximum Mean Discrepancy (MMD) is a Kernel method involes mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:

<center>
<math display="block">
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}
</math>
</center>

Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.

Two important notes:
# The RKHS, and as such MMD, depend on the choice of the kernel;
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).

==== MMD for Feature Adaptation in the RTN ====
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.

Practically this involves an additional penalty function given by the following:

<center>
<math display="block">
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} + \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t}
</math>
</center>

Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.

=== Classifier Adaptation ===
In traditional unsupervised domain adaptation there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function.

The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.

[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]
==== Residual Networks Framework ====
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process, and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself.

That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained, but has since used in novel contexts.

==== Residual Blocks in the RTN ====
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.

It is necessary to note that in this case the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.

==== Entropy Minimization ====
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.

In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:

<center>
<math display="block">
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))
</math>
</center>

The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.

=== Residual Transfer Network ===
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:

<center>
<math display="block">
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}
</math>
</center>

Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty.

The full network, which is trained subject to the above optimization problem, thus takes on the following structure.

[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]

== Experiments ==

=== Set-up ===
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).

The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks.

As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.

=== Results ===
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.

In addition, the ablation study found a number of interesting results:
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.

Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.

=== Discussion ===
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]]
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]]
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]

==== Visualizing Predictions (Versus DAN) ====
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.

==== Layer Responses and Classifier Shift ====
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers.

In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption.

==== Parameter Sensitivity ====
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.

== Conclusion ==
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Like previous models this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier. In particular, this approach allows for easy integration into existing networks, and can be implemented with any standard deep learning software.

For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.

== Critique ==
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach.

==References==
# https://en.wikipedia.org/wiki/Domain_adaptation
# https://people.eecs.berkeley.edu/~jhoffman/domainadapt/
# Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
# Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474 (2014).
# Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.
# Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised domain adaptation by backpropagation." International Conference on Machine Learning. 2015.
# Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." Proceedings of the IEEE International Conference on Computer Vision. 2015.
# He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
# Yang, Jun, Rong Yan, and Alexander G. Hauptmann. "Cross-domain video concept detection using adaptive svms." Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007.
# Duan, Lixin, et al. "Domain adaptation from multiple sources via auxiliary classifiers." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
# Duan, Lixin, et al. "Visual event recognition in videos by learning from web data." IEEE Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012): 1667-1680.
# http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf
# https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space
#He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
# Grandvalet, Yves, and Yoshua Bengio. "Semi-supervised learning by entropy minimization." Advances in neural information processing systems. 2005.
#

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:38:53Z

A2prasad: /* References */

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:38:41Z

A2prasad: /* References */

== Introduction ==
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.

This paper proposes a method for unsupervised domain adaptation which relies on three key components:
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data;
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.

This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.

[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]]
=== Working Example (Office-31) ===
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.

One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).

== Related Work ==
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.

Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains.

The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.

== Residual Transfer Networks ==
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values.
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods.

We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.

The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:

<center>
<math display="block">
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)
</math>
</center>

In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.

=== Structural Overview ===
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:

# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.
# A bottleneck layer, used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.
# An MMD block, with the expressed intention of feature adaptation.
# A residual block, with the expressed intention of classifier adaptation.

This structure is then optimized against a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.

=== Feature Adaptation ===
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.

In particular the authors of this paper impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.

[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]

==== Maximum Mean Discrepancy ====
The Maximum Mean Discrepancy (MMD) is a Kernel method involes mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:

<center>
<math display="block">
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}
</math>
</center>

Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.

Two important notes:
# The RKHS, and as such MMD, depend on the choice of the kernel;
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).

==== MMD for Feature Adaptation in the RTN ====
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.

Practically this involves an additional penalty function given by the following:

<center>
<math display="block">
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} + \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t}
</math>
</center>

Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.

=== Classifier Adaptation ===
In traditional unsupervised domain adaptation there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function.

The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.

[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]
==== Residual Networks Framework ====
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process, and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself.

That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained, but has since used in novel contexts.

==== Residual Blocks in the RTN ====
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.

It is necessary to note that in this case the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.

==== Entropy Minimization ====
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.

In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:

<center>
<math display="block">
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))
</math>
</center>

The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.

=== Residual Transfer Network ===
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:

<center>
<math display="block">
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}
</math>
</center>

Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty.

The full network, which is trained subject to the above optimization problem, thus takes on the following structure.

[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]

== Experiments ==

=== Set-up ===
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).

The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks.

As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.

=== Results ===
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.

In addition, the ablation study found a number of interesting results:
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.

Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.

=== Discussion ===
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]]
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]]
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]

==== Visualizing Predictions (Versus DAN) ====
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.

==== Layer Responses and Classifier Shift ====
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers.

In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption.

==== Parameter Sensitivity ====
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.

== Conclusion ==
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Like previous models this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier. In particular, this approach allows for easy integration into existing networks, and can be implemented with any standard deep learning software.

For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.

== Critique ==
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach.

==References==
#1. https://en.wikipedia.org/wiki/Domain_adaptation
#2. https://people.eecs.berkeley.edu/~jhoffman/domainadapt/

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:37:39Z

A2prasad: /* References */

== Introduction ==
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.

This paper proposes a method for unsupervised domain adaptation which relies on three key components:
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data;
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.

This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.

[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]]
=== Working Example (Office-31) ===
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.

One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).

== Related Work ==
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.

Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains.

The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.

== Residual Transfer Networks ==
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values.
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods.

We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.

The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:

<center>
<math display="block">
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)
</math>
</center>

In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.

=== Structural Overview ===
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:

# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.
# A bottleneck layer, used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.
# An MMD block, with the expressed intention of feature adaptation.
# A residual block, with the expressed intention of classifier adaptation.

This structure is then optimized against a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.

=== Feature Adaptation ===
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.

In particular the authors of this paper impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.

[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]

==== Maximum Mean Discrepancy ====
The Maximum Mean Discrepancy (MMD) is a Kernel method involes mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:

<center>
<math display="block">
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}
</math>
</center>

Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.

Two important notes:
# The RKHS, and as such MMD, depend on the choice of the kernel;
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).

==== MMD for Feature Adaptation in the RTN ====
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.

Practically this involves an additional penalty function given by the following:

<center>
<math display="block">
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} + \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t}
</math>
</center>

Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.

=== Classifier Adaptation ===
In traditional unsupervised domain adaptation there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function.

The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.

[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]
==== Residual Networks Framework ====
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process, and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself.

That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained, but has since used in novel contexts.

==== Residual Blocks in the RTN ====
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.

It is necessary to note that in this case the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.

==== Entropy Minimization ====
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.

In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:

<center>
<math display="block">
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))
</math>
</center>

The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.

=== Residual Transfer Network ===
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:

<center>
<math display="block">
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}
</math>
</center>

Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty.

The full network, which is trained subject to the above optimization problem, thus takes on the following structure.

[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]

== Experiments ==

=== Set-up ===
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).

The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks.

As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.

=== Results ===
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.

In addition, the ablation study found a number of interesting results:
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.

Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.

=== Discussion ===
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]]
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]]
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]

==== Visualizing Predictions (Versus DAN) ====
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.

==== Layer Responses and Classifier Shift ====
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers.

In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption.

==== Parameter Sensitivity ====
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.

== Conclusion ==
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Like previous models this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier. In particular, this approach allows for easy integration into existing networks, and can be implemented with any standard deep learning software.

For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.

== Critique ==
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach.

==References==
1. https://en.wikipedia.org/wiki/Domain_adaptation
2.

Unsupervised Domain Adaptation with Residual Transfer Networks

2017-11-20T06:37:16Z

A2prasad:

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

2017-11-20T05:21:38Z

A2prasad: /* References */

='''Introduction & Background'''=
Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule (Schmidhuber, 1987; Bengio et al., 1992; Andrychowicz et al., 2016; Ravi & Larochelle, 2017), this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model (Santoro et al., 2016) or a Siamese network (Koch, 2015)), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.

The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.

='''Model-Agnostic Meta Learning (MAML)'''=
The goal of the proposed model is rapid adaptation. This setting is usually formalized as few-shot learning.

=== Problem set-up ===
The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few datapoints and training iterations. To do so. the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples.

Let us consider a model denoted by $f$, that maps the observation $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks.

Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1.a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems,
the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem.

A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss LTi from Ti, and then tested on new samples from Ti. The model f is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples.

=== MAML Algorithm ===
[[File:model.png|200px|right|thumb|Figure 1: Diagram of the MAML algorithm]]
The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferrable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$, see Fig 1.

Note that there is no assumption about the form of the model. Only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally lets assume that the model is denoted by $f_{\theta}$. When adapting
to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:

$$
\theta_i ' = \theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta}).
$$

Here $\alpha$ is a the learning rate of each task and considered as a hyperparameter. They consider a single step of update for the rest of the paper, for the sake of the simplicity.

The model parameters are trained by optimizing for the performance
of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows:

$$
\min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) = \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta})})
$$

Note that the meta-optimization is performed over the model parameters $\theta$, whereas the objective is computed using the updated model parameters $\theta'$. The model aims to optimize the model parameters such that one or a small number of gradient step on a new task will produce maximally effective behavior on that task.

Therefore the meta-learning across the tasks is performed via stochastic gradient descent (SGD), such that the model parameters $\theta$ are updated as:

$$
\theta \gets \theta - \beta \nabla_{\theta } \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})
$$
where $\beta$ is the meta step size. Outline of the algorithm is shown in Algorithm 1.
[[File:ershad_alg1.png|500px|center|thumb]]

The MAML meta-gradient update involves a gradient through a gradient. Computationally, this requires an additional backward pass through f to compute Hessian-vector products, which is supported by standard deep learning libraries such as TensorFlow.

='''Different Types of MAML'''=
In this section the MAML algorithm is discussed for different supervised learning and reinforcement learning tasks. The differences between each of these tasks are in their loss function and the way the data is generated. In general, this method does not require additional model parameters nor using any additional meta-learner to learn the update of parameters. Compared to other approaches that tend to “learn to compare new examples in a learned metric space using e.g. Siamese networks or recurrence with attention mechanisms”, the proposed method can be generalized to any other problems including classification, regression and reinforcement learning.

=== Supervised Regression and Classification ===
Few-shot learning is well-studied in this field. For these two types of tasks the horizon $H$ is equal to 1, since the data points are generated i.i.d.

Although any common classification and regression objectives can be used as the loss, the paper uses the following losses for these two tasks.

Regression : For regression we use the mean-square error (MSE):

$$
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \parallel f_{\theta} (\mathbf{x}^{(j)} - \mathbf{y}^{(j)})\parallel_2^2
$$

where $\mathbf{x}^{(j)}$ and $\mathbf{y}^{()j}$ are the input/output pair sampled from task $\mathcal{T}_i$. In K-shot regression tasks, K input/output pairs are provided for learning for each task.

Classification: For classification we use the cross entropy loss:

$$
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \mathbf{y}^{(j)} \log f_{\theta}(\mathbf{x}^{(j)}) + (1-\mathbf{y}^{(j)}) \log (1-f_{\theta}(\mathbf{x}^{(j)}))
$$

According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of $NK$ data points for N-way classification.

Given a distribution over tasks, these loss functions can be directly inserted into the equations in the previous section to perform meta-learning, as detailed in Algorithm 2.
[[File:ershad_alg2.png|500px|center|thumb]]

=== Reinforcement Learning ===
In reinforcement learning (RL), the goal of few-shot meta learning is to enable an agent to quickly acquire a policy for a new test task using only a small amount of experience in the test setting. A new task might involve achieving a new goal or succeeding on a previously trained goal in a new environment. For example an agent may learn how to navigate mazes very quickly so that, when faced with a new maze, it can determine how to reliably reach the exit with only a few samples.

Each RL task $\mathcal{T}_i$ contains an initial state distribution $q_i(\mathbf{x}_1)$ and a transition distribution $$q_i(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t)$ $, and the loss $\mathcal{L}_{\mathcal{T}_i}$ corresponds to the (negative) reward function $R$. The entire task is therefore a Markov decision process (MDP) with horizon H, where the learner is allowed to query a limited number of sample trajectories for few-shot learning. Any aspect of the MDP may change across tasks in $p(\mathcal{T})$. The model being learned, $f_{\theta}$, is a policy that maps from states $\mathbf{x}_t$ to a distribution over actions $a_t$ at each timestep $t \in \{1,...,H\}$. The loss for task $\mathcal{T}_i$ and model $f_{\theta}$ takes the form

$$
\mathcal{L}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{\mathbf{x}_t,a_t \sim f_{\theta},q_{\mathcal{T}_i}} \big [\sum_{t=1}^H R_i(\mathbf{x}_t,a_t)\big ]
$$

In K-shot reinforcement learning, K rollouts from $f_{\theta}$ and task $\mathcal{T}_i$, $(\mathbf{x}_1,a_1,...,\mathbf{x}_H)$, and the corresponding rewards $ R(\mathbf{x}_t,a_t)$, may be used for adaptation on a new task $\mathcal{T}_i$.

Since the expected reward is generally not differentiable due to unknown dynamics, we use policy gradient methods to estimate the gradient both for the model gradient update(s) and the meta-optimization. Since policy gradients are an on-policy algorithm, each additional gradient step during the adaptation of $f_{\theta}$ requires new samples from the current policy $f_{\theta_i}$ . We detail the algorithm in Algorithm 3.
[[File:ershad_alg3.png|500px|center|thumb]]

='''Experiments'''=

=== Regression ===
We start with a simple regression problem that illustrates the basic principles of MAML. Each task involves regressing from the input to the output of a sine wave, where the amplitude and phase of the sinusoid are varied between tasks. Thus, $p(\mathcal{T})$ is continuous, and the input and output both have a dimensionality of 1. During training and testing, datapoints are sampled uniformly. The loss is the mean-squared error between the prediction and true value. The regressor is a neural network model with 2 hidden layers of size 40 with ReLU nonlinearities. When training with MAML, we use one gradient update with K = 10 examples with a fixed step size 0.01, and use Adam as the metaoptimizer [2]. The baselines are likewise trained with Adam. To evaluate performance, we finetune a single meta-learned model on varying numbers of K examples, and compare performance to two baselines: (a) pretraining on all of the tasks, which entails training a network to regress to random sinusoid functions and then, at test-time, fine-tuning with gradient descent on the K provided points, using an automatically tuned step size, and (b) an oracle which receives the true amplitude and phase as input.

We evaluate performance by fine-tuning the model learned by MAML and the pretrained model on $K = \{ 5,10,20 \}$ datapoints. During fine-tuning, each gradient step is computed using the same $K$ datapoints. Results are shown in Fig 2.

[[File:ershad_results1.png|500px|center|thumb|Figure 2: Few-shot adaptation for the simple regression task. Left: Note that MAML is able to estimate parts of the curve where there are no datapoints, indicating that the model has learned about the periodic structure of sine waves. Right: Fine-tuning of a model pretrained on the same distribution of tasks without MAML, with a tuned step size. Due to the often contradictory outputs on the pre-training tasks, this model is unable to recover a suitable representation and fails to extrapolate from the small number of test-time samples.]]

=== Classification ===

For classification evaluation, Omniglot and MiniImagenet datasets are used. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets.

The experiment involves fast learning of N-way classification with 1 or 5 shots. The problem of N-way classification is set up as follows: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes. For Omniglot, 1200 characters are selected randomly for training, irrespective of alphabet, and use the remaining for testing. The Omniglot dataset is augmented with rotations by multiples of 90 degrees.

The model follows the same architecture as the embedding function that has 4 modules with a 3-by-3 convolutions and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2-by-2 max-pooling. The Omniglot images are downsampled to 28-by-28, so the dimensionality of the last hidden layer is 64. The last layer is fed into a softmax. For Omniglot, strided convolutions is used instead of max-pooling. For MiniImagenet, 32 filters per layer are used to reduce overfitting. In order to also provide a fair comparison against memory-augmented neural networks [3] and to test the flexibility of MAML, the results for a non-convolutional network are also provided.

[[File:ershad_results2.png|500px|center|thumb|Table 1: Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The $\pm$ shows 95% confidence intervals over tasks. ]]

=== Reinforcement Learning ===
Several simulated continuous control environments are used for RL evaluation. In all of the domain, the MAML model is a neural network policy with two hidden layers of size 100, and ReLU activations. The gradient updates are computed using vanilla policy gradient and trust-region policy optimization (TRPO) is used as the meta-optimizer.

In order to avoid computing third derivatives, finite differences are computed to
compute the Hessian-vector products for TRPO. For both learning and meta-learning updates, we use the standard linear feature baseline proposed by [4], which is fitted separately at each iteration for each sampled task in the batch.

Three baseline models for the comparison are:
(a) pretraining one policy on all of the tasks and then fine-tuning
(b) training a policy from randomly initialized weights
(c) an oracle policy which receives the parameters of the task as input, which for the tasks below corresponds to a goal position, goal direction, or goal velocity for the agent.

The baseline models of (a) and (b) are fine-tuned with gradient descent with a manually tuned step size.

2D Navigation: In the first meta-RL experiment, the authors study a set of tasks where a point agent must move to different goal positions in 2D, randomly chosen for each task within a unit square. The observation is the current 2D position, and actions correspond to velocity commands clipped to be in the range [-0.1; 0.1]. The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100. The policy was trained with MAML
to maximize performance after 1 policy gradient update using 20 trajectories. They compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. Results are shown in Fig. 3.

[[File:ershad_results3.png|500px|center|thumb|Figure 3: Top: quantitative results from 2D navigation task, Bottom: qualitative comparison between model learned with MAML and with fine-tuning from a pretrained network ]]

Locomotion. To study how well MAML can scale to more complex deep RL problems, we also study adaptation on high-dimensional locomotion tasks with the MuJoCo simulator [5]. The tasks require two simulated robots – a planar cheetah and a 3D quadruped (the “ant”) – to run in a particular direction or at a particular velocity. In the goal velocity experiments, the reward is the negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random between 0 and 2 for the cheetah and between 0 and 3 for the ant. In the goal direction experiments, the reward is the magnitude of the velocity in either the forward or backward direction, chosen at random for each task in p(T ). The horizon is H = 200, with 20 rollouts per gradient step for all problems except the ant forward/backward task, which used 40 rollouts per step. The results in Figure 5 show that MAML learns a model that can quickly adapt its velocity and direction with even
just a single gradient update, and continues to improve with more gradient steps. The results also show that, on these challenging tasks, the MAML initialization substantially outperforms random initialization and pretraining.
[[File:ershad_results4.png|500px|center|thumb|Figure 4: Reinforcement learning results for the half-cheetah and ant locomotion tasks, with the tasks shown on the far right. ]]

='''Conclusion'''=

The paper introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. The approach has a number of benefits. It is simple and does not introduce any learned parameters for meta-learning. It can be combined with any model representation that is amenable to gradient-based training, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since our method merely produces a weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though it demonstrates state-of-the-art results on classification with only one or five examples per class. The authors also show that the method can adapt an RL agent using policy gradients and a very modest amount of experience.

='''References'''=
# Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.

# Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.

# Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.

# Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.

# Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.

# Videos the learned policies can be found in https://sites.google.com/view/maml.

Implementation Example: https://github.com/cbfinn/maml

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

2017-11-20T05:14:15Z

A2prasad: /* References */

='''Introduction & Background'''=
Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule (Schmidhuber, 1987; Bengio et al., 1992; Andrychowicz et al., 2016; Ravi & Larochelle, 2017), this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model (Santoro et al., 2016) or a Siamese network (Koch, 2015)), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.

The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.

='''Model-Agnostic Meta Learning (MAML)'''=
The goal of the proposed model is rapid adaptation. This setting is usually formalized as few-shot learning.

=== Problem set-up ===
The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few datapoints and training iterations. To do so. the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples.

Let us consider a model denoted by $f$, that maps the observation $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks.

Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1.a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems,
the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem.

A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss LTi from Ti, and then tested on new samples from Ti. The model f is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples.

=== MAML Algorithm ===
[[File:model.png|200px|right|thumb|Figure 1: Diagram of the MAML algorithm]]
The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferrable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$, see Fig 1.

Note that there is no assumption about the form of the model. Only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally lets assume that the model is denoted by $f_{\theta}$. When adapting
to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:

$$
\theta_i ' = \theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta}).
$$

Here $\alpha$ is a the learning rate of each task and considered as a hyperparameter. They consider a single step of update for the rest of the paper, for the sake of the simplicity.

The model parameters are trained by optimizing for the performance
of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows:

$$
\min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) = \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta})})
$$

Note that the meta-optimization is performed over the model parameters $\theta$, whereas the objective is computed using the updated model parameters $\theta'$. The model aims to optimize the model parameters such that one or a small number of gradient step on a new task will produce maximally effective behavior on that task.

Therefore the meta-learning across the tasks is performed via stochastic gradient descent (SGD), such that the model parameters $\theta$ are updated as:

$$
\theta \gets \theta - \beta \nabla_{\theta } \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})
$$
where $\beta$ is the meta step size. Outline of the algorithm is shown in Algorithm 1.
[[File:ershad_alg1.png|500px|center|thumb]]

The MAML meta-gradient update involves a gradient through a gradient. Computationally, this requires an additional backward pass through f to compute Hessian-vector products, which is supported by standard deep learning libraries such as TensorFlow.

='''Different Types of MAML'''=
In this section the MAML algorithm is discussed for different supervised learning and reinforcement learning tasks. The differences between each of these tasks are in their loss function and the way the data is generated. In general, this method does not require additional model parameters nor using any additional meta-learner to learn the update of parameters. Compared to other approaches that tend to “learn to compare new examples in a learned metric space using e.g. Siamese networks or recurrence with attention mechanisms”, the proposed method can be generalized to any other problems including classification, regression and reinforcement learning.

=== Supervised Regression and Classification ===
Few-shot learning is well-studied in this field. For these two types of tasks the horizon $H$ is equal to 1, since the data points are generated i.i.d.

Although any common classification and regression objectives can be used as the loss, the paper uses the following losses for these two tasks.

Regression : For regression we use the mean-square error (MSE):

$$
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \parallel f_{\theta} (\mathbf{x}^{(j)} - \mathbf{y}^{(j)})\parallel_2^2
$$

where $\mathbf{x}^{(j)}$ and $\mathbf{y}^{()j}$ are the input/output pair sampled from task $\mathcal{T}_i$. In K-shot regression tasks, K input/output pairs are provided for learning for each task.

Classification: For classification we use the cross entropy loss:

$$
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \mathbf{y}^{(j)} \log f_{\theta}(\mathbf{x}^{(j)}) + (1-\mathbf{y}^{(j)}) \log (1-f_{\theta}(\mathbf{x}^{(j)}))
$$

According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of $NK$ data points for N-way classification.

Given a distribution over tasks, these loss functions can be directly inserted into the equations in the previous section to perform meta-learning, as detailed in Algorithm 2.
[[File:ershad_alg2.png|500px|center|thumb]]

=== Reinforcement Learning ===
In reinforcement learning (RL), the goal of few-shot meta learning is to enable an agent to quickly acquire a policy for a new test task using only a small amount of experience in the test setting. A new task might involve achieving a new goal or succeeding on a previously trained goal in a new environment. For example an agent may learn how to navigate mazes very quickly so that, when faced with a new maze, it can determine how to reliably reach the exit with only a few samples.

Each RL task $\mathcal{T}_i$ contains an initial state distribution $q_i(\mathbf{x}_1)$ and a transition distribution $$q_i(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t)$ $, and the loss $\mathcal{L}_{\mathcal{T}_i}$ corresponds to the (negative) reward function $R$. The entire task is therefore a Markov decision process (MDP) with horizon H, where the learner is allowed to query a limited number of sample trajectories for few-shot learning. Any aspect of the MDP may change across tasks in $p(\mathcal{T})$. The model being learned, $f_{\theta}$, is a policy that maps from states $\mathbf{x}_t$ to a distribution over actions $a_t$ at each timestep $t \in \{1,...,H\}$. The loss for task $\mathcal{T}_i$ and model $f_{\theta}$ takes the form

$$
\mathcal{L}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{\mathbf{x}_t,a_t \sim f_{\theta},q_{\mathcal{T}_i}} \big [\sum_{t=1}^H R_i(\mathbf{x}_t,a_t)\big ]
$$

In K-shot reinforcement learning, K rollouts from $f_{\theta}$ and task $\mathcal{T}_i$, $(\mathbf{x}_1,a_1,...,\mathbf{x}_H)$, and the corresponding rewards $ R(\mathbf{x}_t,a_t)$, may be used for adaptation on a new task $\mathcal{T}_i$.

Since the expected reward is generally not differentiable due to unknown dynamics, we use policy gradient methods to estimate the gradient both for the model gradient update(s) and the meta-optimization. Since policy gradients are an on-policy algorithm, each additional gradient step during the adaptation of $f_{\theta}$ requires new samples from the current policy $f_{\theta_i}$ . We detail the algorithm in Algorithm 3.
[[File:ershad_alg3.png|500px|center|thumb]]

='''Experiments'''=

=== Regression ===
We start with a simple regression problem that illustrates the basic principles of MAML. Each task involves regressing from the input to the output of a sine wave, where the amplitude and phase of the sinusoid are varied between tasks. Thus, $p(\mathcal{T})$ is continuous, and the input and output both have a dimensionality of 1. During training and testing, datapoints are sampled uniformly. The loss is the mean-squared error between the prediction and true value. The regressor is a neural network model with 2 hidden layers of size 40 with ReLU nonlinearities. When training with MAML, we use one gradient update with K = 10 examples with a fixed step size 0.01, and use Adam as the metaoptimizer [2]. The baselines are likewise trained with Adam. To evaluate performance, we finetune a single meta-learned model on varying numbers of K examples, and compare performance to two baselines: (a) pretraining on all of the tasks, which entails training a network to regress to random sinusoid functions and then, at test-time, fine-tuning with gradient descent on the K provided points, using an automatically tuned step size, and (b) an oracle which receives the true amplitude and phase as input.

We evaluate performance by fine-tuning the model learned by MAML and the pretrained model on $K = \{ 5,10,20 \}$ datapoints. During fine-tuning, each gradient step is computed using the same $K$ datapoints. Results are shown in Fig 2.

[[File:ershad_results1.png|500px|center|thumb|Figure 2: Few-shot adaptation for the simple regression task. Left: Note that MAML is able to estimate parts of the curve where there are no datapoints, indicating that the model has learned about the periodic structure of sine waves. Right: Fine-tuning of a model pretrained on the same distribution of tasks without MAML, with a tuned step size. Due to the often contradictory outputs on the pre-training tasks, this model is unable to recover a suitable representation and fails to extrapolate from the small number of test-time samples.]]

=== Classification ===

For classification evaluation, Omniglot and MiniImagenet datasets are used. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets.

The experiment involves fast learning of N-way classification with 1 or 5 shots. The problem of N-way classification is set up as follows: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes. For Omniglot, 1200 characters are selected randomly for training, irrespective of alphabet, and use the remaining for testing. The Omniglot dataset is augmented with rotations by multiples of 90 degrees.

The model follows the same architecture as the embedding function that has 4 modules with a 3-by-3 convolutions and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2-by-2 max-pooling. The Omniglot images are downsampled to 28-by-28, so the dimensionality of the last hidden layer is 64. The last layer is fed into a softmax. For Omniglot, strided convolutions is used instead of max-pooling. For MiniImagenet, 32 filters per layer are used to reduce overfitting. In order to also provide a fair comparison against memory-augmented neural networks [3] and to test the flexibility of MAML, the results for a non-convolutional network are also provided.

[[File:ershad_results2.png|500px|center|thumb|Table 1: Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The $\pm$ shows 95% confidence intervals over tasks. ]]

=== Reinforcement Learning ===
Several simulated continuous control environments are used for RL evaluation. In all of the domain, the MAML model is a neural network policy with two hidden layers of size 100, and ReLU activations. The gradient updates are computed using vanilla policy gradient and trust-region policy optimization (TRPO) is used as the meta-optimizer.

In order to avoid computing third derivatives, finite differences are computed to
compute the Hessian-vector products for TRPO. For both learning and meta-learning updates, we use the standard linear feature baseline proposed by [4], which is fitted separately at each iteration for each sampled task in the batch.

Three baseline models for the comparison are:
(a) pretraining one policy on all of the tasks and then fine-tuning
(b) training a policy from randomly initialized weights
(c) an oracle policy which receives the parameters of the task as input, which for the tasks below corresponds to a goal position, goal direction, or goal velocity for the agent.

The baseline models of (a) and (b) are fine-tuned with gradient descent with a manually tuned step size.

2D Navigation: In the first meta-RL experiment, the authors study a set of tasks where a point agent must move to different goal positions in 2D, randomly chosen for each task within a unit square. The observation is the current 2D position, and actions correspond to velocity commands clipped to be in the range [-0.1; 0.1]. The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100. The policy was trained with MAML
to maximize performance after 1 policy gradient update using 20 trajectories. They compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. Results are shown in Fig. 3.

[[File:ershad_results3.png|500px|center|thumb|Figure 3: Top: quantitative results from 2D navigation task, Bottom: qualitative comparison between model learned with MAML and with fine-tuning from a pretrained network ]]

Locomotion. To study how well MAML can scale to more complex deep RL problems, we also study adaptation on high-dimensional locomotion tasks with the MuJoCo simulator [5]. The tasks require two simulated robots – a planar cheetah and a 3D quadruped (the “ant”) – to run in a particular direction or at a particular velocity. In the goal velocity experiments, the reward is the negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random between 0 and 2 for the cheetah and between 0 and 3 for the ant. In the goal direction experiments, the reward is the magnitude of the velocity in either the forward or backward direction, chosen at random for each task in p(T ). The horizon is H = 200, with 20 rollouts per gradient step for all problems except the ant forward/backward task, which used 40 rollouts per step. The results in Figure 5 show that MAML learns a model that can quickly adapt its velocity and direction with even
just a single gradient update, and continues to improve with more gradient steps. The results also show that, on these challenging tasks, the MAML initialization substantially outperforms random initialization and pretraining.
[[File:ershad_results4.png|500px|center|thumb|Figure 4: Reinforcement learning results for the half-cheetah and ant locomotion tasks, with the tasks shown on the far right. ]]

='''Conclusion'''=

The paper introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. The approach has a number of benefits. It is simple and does not introduce any learned parameters for meta-learning. It can be combined with any model representation that is amenable to gradient-based training, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since our method merely produces a weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though it demonstrates state-of-the-art results on classification with only one or five examples per class. The authors also show that the method can adapt an RL agent using policy gradients and a very modest amount of experience.

='''References'''=
# Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.

# Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.

# Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.

# Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.

# Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.

Implementation Example: https://github.com/cbfinn/maml

Deep Exploration via Bootstrapped DQN

2017-11-20T02:45:53Z

A2prasad: /* Critique */

== Details ==

'''Title''': Deep Exploration via Bootstrapped DQN

'''Authors''': Ian Osband {1,2}, Charles Blundell {2}, Alexander Pritzel {2}, Benjamin Van Roy {1}

'''Organisations''':
# Stanford University
# Google Deepmind

'''Conference''': NIPS 2016

'''URL''': [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn papers.nips.cc]

'''Online code sources'''
* [https://github.com/iassael/torch-bootstrapped-dqn github.com/iassael/torch-bootstrapped-dqn]

This summary contains background knowledge from Section 2-7 (except Section 5). Feel free to skip if you already know.

== Intro to Reinforcement Learning ==

In reinforcement learning, an agent interacts with an environment with the goal to maximize its long term reward. A common application of reinforcement learning is to the [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn how to get the maximum long term reward?

There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.

[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]

There are many strategies to approach this '''exploration-exploitation dilemma'''. Some [https://web.stanford.edu/class/msande338/lec9.pdf common strategies] for optimizing in an exploration-exploitation setting are Random Walk, Curiosity-Driven Exploration, and Thompson Sampling. A lot of these approaches are provably efficient, but assume that the state space is not very large. For instance, the approach called Curiosity-Driven Exploration aims to take actions that lead to immediate additional information. This requires the model to search “every possible cell in the grid” which is not desirable if state space is very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases.

This paper tries to use a Thompson sampling like approach to make decisions.

== Thompson Sampling[[#References|[1]]] ==

In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.

Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,

$$
\begin{align*}
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})
\end{align*}
$$

[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]

This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like

$$
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)
$$

=== How can we use this in reinforcement learning? ===

We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.

== Bootstrapping [[#References|[2,3]]] ==

This idea may be unfamiliar to some people, so I thought it would be a good idea to include this. In statistics, bootstrapping is a method to generate new samples from a given sample. Suppose that we have a given population, and we want to study a measure $\theta$. So, we just find $n$ sample points (sample $\{D_i\}_{i=1}^n$), calculate this measure $\hat{\theta}$ for these $n$ points, and make our inference.

If we later wish to find a better bound on $\hat{\theta}$, i.e. suppose we want to say that $\delta_1 \leq \hat{\theta} \leq \delta_2$ with a confidence of $c$, then we can use bootstrapping for this.

Using bootstrapping, we can create a new sample $\{D'_i\}_{i=1}^{n'}$ by '''randomly sampling $n'$ times from $D$, with replacement'''. So, if $D=\{1,2,3,4\}$, a $D'$ of size $n'=10$ could be $\{1,4,4,3,2,2,2,1,3,4\}$. We do this a sufficient $k$ number of times, calculate $\hat{\theta}$ each time, and thus get a distribution $\{\hat{\theta}_i\}_{i=1}^k$. Now, we can choose the $100\cdot c$th and $100\cdot(1-c)$th percentile of this distribution, (let them be $\hat{\theta}_\alpha$ and $\hat{\theta}_\beta$ respectively) and say

$$\hat{\theta}_\alpha \leq \hat{\theta} \leq \hat{\theta}_\beta, \mbox{with confidence }c$$

== Why choose bootstrap and not dropout? ==

There is previous work[[#References|[4]]] that establishes dropout as a good way to train NNs on a posterior such that the trained NN works like a function approximator that is close to the actual posterior. But, there are several problems with the predictions of this trained NN. The figures below are from the appendix of this paper. The left image is the NN trained by the authors of this paper on a sample noisy distribution and the right image is from the accompanying web demo from [[#References|[4]]], where the authors of [[#References|[4]]] show that their NN converges around the mean with a good confidence.

[[File:dropout_results.png|thumb||center||700px|Source: this paper's appendix]]

According to the authors of this paper,
# Even though [[#References|[4]]] says that dropout converges arond the mean, their experiment actually behaves weirdly around a reasonable point like $x=0.75$. They think that this happens because dropout only affects the region local to the original data.
# Samples from the NN trained on the original data do not look like a reasonable posterior (very spiky).
# The trained NN collapses to zero uncertainty at the data points from the original data.

== Q Learning and Deep Q Networks [[#References|[5]]] ==

At any point of time, our rewards dictate what our actions should be. Also, in general, we want good long term rewards. For example, if we are playing a first person shooter game, it is a good idea to go out of cover to kill an enemy, even if some health is lost. Similarly, in reinforcement learning, we want to maximize our long term reward. So if at each time $t$, the reward is $r_t$, then a naive way is to say we want to maximise

$$
R_t = \sum_{i=0}^{\infty}r_t
$$

But, this reward is unbounded. So technically it could tend to $\infty$ in a lot of the cases. This is why we use a '''discounted reward'''.

$$
R_t = \sum_{i=0}^{\infty}\gamma^t r_t
$$

Here, we take $0\leq \gamma \lt 1$. So, what this means is that we value our current reward the most ($r_0$ has a coefficient of $1$), but we also consider the future possible rewards. So if we had two choices: get $+4$ now and $0$ at all other timesteps, or get $-2$ now and $+2$ after $3$ timesteps for $20$ timesteps, we choose the latter ($\gamma=0.9$). This is because $(+4) < (-2)+0.9^3(2+0.9\cdot2+\cdots+0.9^{19}\cdot2)$.

A '''policy''' $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is just a function that tells us what action to take in a given state $s\in \mathbb{S}$. Our goal is to find the best policy $\pi^*$ that maximises the reward from a given state $s$. So, a '''value function''' is defined from $s$ (which the agent is in, at timestep $t$) and following the policy $\pi$ as $V^\pi(s) = \mathbb{E}[R_t]$. The optimal value function is then simply

$$
V^*(s)=\displaystyle\max_{\pi}V^\pi(s)
$$

For convenience however, it is better to work with the '''Q function''' $Q: \mathbb{S}\times\mathbb{A} \rightarrow \mathbb{R}$. $Q$ is defined similarly as $V$. It is the expected return after taking an action $a$ in the given state $s$. So, $Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$. The optimal $Q$ function is

$$
Q^*(s,a)=\displaystyle\max_{\pi}Q^\pi(s,a)
$$

Suppose that we know $Q^*$. Then, if we know that we are supposed to start at $s$ and take an action $a$ right now, what is the best course of action from the next time step? We just choose the optimal action $a'$ at the next state $s'$ that we reach. The optimal action $a'$ at state $s'$ is simply the argument $a_x$ that maximises our $Q^*(s',\cdot)$.

$$
a'=\displaystyle\arg\max_{a_x} Q^*(s',a_x)
$$

So, our best expected reward from $s$ taking action $a$ is $\mathbb{E}[r_t+\gamma\mathbb{E}[R_{t+1}]]$. This is known as the '''Bellman equation''':

$$
Q^*(s,a)=\mathbb{E}[r_t+\gamma \displaystyle\arg\max_{a_x} Q^*(s',a_x)]
$$

In Q learning, we use a deep neural network with weights $\theta$ as a function approximator for $Q^*$. The '''naive way''' to do this is to design a deep neural net that takes as input the state $s$ and action $a$, and produces an approximation to $Q^*$.

* Suppose our neural net weights are $\theta_i$ at iteration $i$.
* We want to train our neural net on the case when we are at $s$, take action $a$, get reward $r$, and reach $s'$.
* To find out what action is best from $s'$, i.e. $a'$, we have to simulate all actions from $s'$. We can do this after we complete this iteration, then run $s',a_x$ for all $a_x\in\mathbb{A}$. But, we don't know how to complete this iteration without knowing this $a'$. So, another way is to simulate all actions from $s'$ using last known set of weights $\theta_{i-1}$. We just simulate state $s'$, action $a_x$ for all $a_x\in\mathbb{A}$ from the previous state and get $Q^*(s',a_x;\theta_{i-1})$. ('''Note''' that some papers do not use the set of weights from the previous iteration $\theta_{i-1}$. Instead they fix the weights for finding the best action for every $\tau$ steps to $\theta^-$, and do $Q^*(s',a_x;\theta^-)$ for $a_x\in\mathbb{A}$ and use this for the target value.)
* Now we can compute our loss function using the Bellman equation, and backpropagate.
$$
\mbox{loss}=\mbox{target}-\mbox{prediction}=(r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1}))-Q^*(s,a;\theta_i)
$$

The '''problem''' with this approach is that at every iteration $i$, we have to do $|\mathbb{A}|$ forward passes on the previous set of weights $\theta_{i-1}$ to find out the best action $a'$ at $s'$. This becomes infeasible quickly with more possible actions.

Authors of [[#References|[5]]] therefore use another kind of architecture. This architecture takes as input the state $s$, and computes the values $Q^*(s,a_x)$ for $a_x\in\mathbb{A}$. So there are $|\mathbb{A}|$ outputs. This basically parallelizes the forward passes so that $r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1})$ can be done with just a single pass through the outputs.

[[File:DQN_arch.png|thumb||||600px|Source: [https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/DQNBreakoutBlocks.png leonardoaraujosantos.gitbooks.io]]]

'''Note:''' When I say state $s$ as an input, I mean some representation of $s$. Since the environment is a partially observable MDP, it is hard to know $s$. So, we can for example, apply a convnet on the frames and get an idea of what the current state is. We pass this output to the input of the DNN (DNN is the fully connected layer for the convnet then).

=== Experience Replay ===

Authors of this paper borrow the concept of experience replay from [[#References|[5,6]]]. In experience replay, we do training in episodes. In each episode, we play and store consecutive $(s,a,r,s')$ tuples in the experience replay buffer. Then after the play, we choose random samples from this buffer and do our training.

Advantages of experience replay over simple online Q learning[[#References|[5]]]:
* '''Better data efficiency''': It is better to use one transition many times to learn again and again, rather than just learn once from it.
* Learning from consecutive samples is difficult because of correlated data. Experience replay breaks this correlation.
* Online learning means the input is decided by the previous action. So, if the maximising action is to go left in some game, next inputs would be about what happens when we go left. This can cause the optimiser to get stuck in a feedback loop, or even diverge, as [[#Reference|[7]]] points out.

== Double Q Learning ==

=== Problem with Q Learning[[#References|[8]]] ===

For a simple neural network, each update tries to shift the current $Q^*$ estimate to a new value:

$$
Q^*(s,a) \leftarrow r+\gamma\displaystyle\max_{a_x}Q^*(s',a_x)
$$

Suppose the neural net has some inherent noise $\epsilon$. So, the neural net actually stores a value $\mathbb{Q}^*$ given by

$$
\mathbb{Q}^* = Q^*+\epsilon
$$

Even if $\epsilon$ has zero mean in the beginning, using the $\max$ operator at the update steps will start propagating $\gamma\cdot\max \mathbb{Q}^*$. This leads to a non zero mean subsequently. The problem is that "max causes overestimation because it does not preserve the zero-mean property of the errors of its operands." ([[#References|[8]]]) Thus, Q learning is more likely to choose overoptimistic values.

=== How does Double Q Learning work? [[#References|[9]]] ===

The problem can be solved by using two sets of weights $\theta$ and $\Theta$. The $\mbox{target}$ can be broken up as

$$
\mbox{target} = r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta) = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta) = r+Q^*(s',a';\theta)
$$

Using double Q learning, we '''select''' the best action using current weights $\theta$ and '''evaluate''' the $Q^*$ value to decide the target value using $\Theta$.

$$
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\Theta) = r+Q^*(s',a';\Theta)
$$

This makes the evaluation fairer.

=== Double Deep Q Learning ===

[[#References|[9]]] further talks about how to use this for deep learning without much additional overhead. The suggestion is to use $\theta^-$ as $\Theta$.

$$
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta^-) = r+Q^*(s',a';\theta^-)
$$

== Bootstrapped DQN ==

The authors propose an architecture that has a shared network and $K$ bootstrap heads. So, suppose our experience buffer $E$ has $n$ datapoints, where each datapoint is a $(s,a,r,s')$ tuple. Each bootstrap head trains on a different buffer $E_i$, where each $E_i$ has been constructed by sampling $n$ datapoints from the original experience buffer $E$ with replacement ('''bootstrap method''').

Because each of the heads train on a different buffer, they model a different $Q^*$ function (say $Q^*_k$). Now, for each episode, we first choose a specific $Q^*_k=Q^*_s$. This $Q^*_s$ helps us create the experience buffer for the episode. From any state $s_t$, we populate the experience buffer by choosing the next action $a_t$ that maximises $Q^*_s$. (similar to '''Thompson Sampling''')

$$
a_t = \displaystyle\arg\max_a Q^*_s(s_t,a_t)
$$

Also, along with $s_t,a_t,r_t,s_{t+1}$, they push a bootstrap mask $m_t$. This mask is basically is a binary vector of size $K$, and it tells which $Q_k$ should be affected by this datapoint, if it is chosen as a training point. So, for example, if $K=5$ and there is a experience tuple $(s_t,a_t,r_t,s_{t+1},m_t)$ where $m_t=(0,1,1,0,1)$, then $(s_t,a_t,r_t,s_{t+1})$ should only affect $Q_2,Q_3$ and $Q_5$.

So, at each iteration, we just choose few points from this buffer and train the respective $Q_{(\cdot)}$ based on the bootstrap masks.

=== How to generate masks? ===

Masks are created by sampling from the '''masking distribution'''. Now, there are many ways to choose this masking distribution:

* If for each datapoint $D_i$ ($i=1$ to $n$), we mask from $\mbox{Bernoulli}(0.5)$, this will roughly allow us to have half the points from the original buffer. To get to size $n$, we duplicate these points by doubling the weights for each datapoint. This essentially gives us a '''double or nothing''' bootstrap[[#References|[10]]].
* If the mask is $(1, 1 \cdots 1)$, then this becomes an '''ensemble learning''' method.
* $m_t~\mbox{Poi}(1)$ (poisson distribution)
* $m_t[k]~\mbox{Exp}(1)$ (exponential distribution)

For this paper's results, the authors used a $\mbox{Bernoulli}(p)$ distribution.

== Related Work ==

The authors mention the method described in [[#References|[11]]]. The authors of [[#References|[11]]] talk about the principle of "optimism in the face of uncertainty" and modify the reward function to encourage state-action pairs that have not been seen often:

$$
R(s,a) \leftarrow R(s,a)+\beta\cdot\mbox{novelty}(s,a)
$$

According to the authors, [[#References|[11]]]'s DQN algorithm relies on a lot of hand tuning and is only good for non stochastic problems. The authors further compare their results to [[#References|[11]]]'s results on Atari.

The authors also mention an existing algorithm PSRL[[#References|[12,13]]], or posterior sampling based RL. However, this algorithm requires a solved MDP, which is not feasible for large systems. Bootstrapped DQN approximates this idea by sampling from approximate $Q^*$ functions.

Further, the authors mention that the work in [[#References|[12,13]]] has been followed by RLSVI[[#Reference|[14]]] which solves the problem for linear cases.

== Deep Exploration: Why is Bootstrapped DQN so good at it? ==

The authors consider a simple example to demonstrate the effectiveness of bootstrapped DQN at deep exploration.

[[File:deep_exploration_example.png|thumb||center||700px|Source: this paper, section 5.1]]

In this example, the agent starts at $s_2$. There are $N$ steps, and $N+9$ timesteps to generate the experience buffer. The agent is said to have learned the optimal policy if it achieves the best possible reward of $10$ (go to the rightmost state in $N-1$ timesteps, then stay there for $10$ timesteps), for at least $100$ such episodes. The results they got:

[[File:deep_exploration_results.png|thumb||center||700px|Source: this paper, section 5.1]]

The blue dots indicate when the agent learnt the optimal policy. If this took more than $2000$ episodes, they indicate it with a red dot. Thompson DQN is DQN with posterior sampling at every timestep. Ensemble DQN is same as bootstrapped DQN except that the mask is all $(1,1 \cdots 1)$. It is evident from the graphs that bootstrapped DQN can achieve deep exploration better than these two methods, and DQN.

=== But why is it better? ===

The authors say that this is because bootstrapped DQN constructs different approximations to the posterior $Q^*$ with the same initial data. This diversity of approximations is because of random initalization of weights for the $Q^*_k$ heads. This means that these heads start out trying random actions (because of diverse random initial $Q^*_k$), but when some head finds a good state and generalises to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually other heads will either find other good states, or end up learning the best good states found by the other heads.

So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.

== Results ==

The authors test their architecture on 49 Atari games. They mention that there has been recent work to improve the performance of DDQNs, but those are tweaks whose intentions are orthogonal to this paper's idea. So, they don't compare their results with them.

=== Scale: What values of $K$, $p$ are best? ===

[[File:scale_k_p.png|thumb||center||800px|Source: this paper, section 6.1]]

Recall that $K$ is the number of bootstrap heads and $p$ is the parameter for the masking distribution (Bernoulli). The authors say that around $K=10$, the performance reaches close to the peak, so it should be good.

$p$ also represents the amount of data sharing. This is because lesser $p$ means there is lesser chance (due to the Bernoulli distribution) that the corresponding datapoint is taken into the bootstrapped dataset $D_i$. So, lesser $p$ means more identical datapoints, hence more heads share their datapoints. However, the value of $p$ doesn't seem to affect the rewards achieved over time. The authors give the following reasons for it:

* The heads start with random weights for $Q^*$, so the targets (which use $Q^*$) turn out to be different. So the update rules are different.
* Atari is deterministic.
* Because of the initial diversity, the heads will learn differently even if they predict the same action for the given state.

$p=1$ is the value they use finally, because this reduces the no. of identical datapoints and reduces time.

=== Performance on Atari ===

In general, the results tell us that bootstrapped DQN achieves better results.

[[File:atari_results_bootstrapped_dqn.png|thumb||center||800px|Source: this paper, section 6.2]]

The authors plot the improvement they achieved with bootstrapped DQN with the games. They define '''improvement''' to be $x$ if bootstrapped DQN achieves a better result than DQN in $\frac{1}{x}$ frames.

[[File:bdqn_improvement.png|thumb||center||1000px|Source: this paper, section 6.2]]

The authors say that bootstrapped DQN doesn't work good on all Atari games. They point out that there are some challenging games, where exploration is key but bootstrapped DQN doesn't do good enough (but does better than DQN). Some of these games are Frostbite and Montezuma’s Revenge. They say that even better exploration may help, but also point out that there may be other problems like: network instability, reward clipping and temporally extended rewards.

=== Improvement: Highest Score Reached & how fast is this high score reached? ===

The authors plot the improvement graphs after 20m and 200m frames.

[[File:cumulative_rewards_bdqn.png|thumb||center||700px|Source: this paper, section 6.3]]

=== Visualisation of Results ===

One of the authors' [https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM youtube playlist] can be found online.

The authors also point out that just purely using bootstrapped DQN as an exploitative strategy is pretty good by itself, better than vanilla DQN. This is because of the deep exploration capabilities of bootstrapped DQN, since it can use the best states it knows and also plan to try out states it doesn't have any information about. Even in the videos, it can be seen that the heads agree at all the crucial decisions, but stay diverse at other less important steps.

== Critique ==

It would be very interesting and a great addition to the the experimental section of the paper, if the authors would have compared with Asynchronous methods of exploration of the state space first introduced in [15]. The authors unfortunately only compared their DQN with the original DQN and not all the other variations in the literature.

=== Different way to do exploration-exploitation? ===

Instead of choosing the next action $a_t$ that maximises $Q^*_s$, they could have chosen different actions $a_i$ with probabilities

$$
\mathbb{P}(s_t,a_i) = \frac{Q^*_s(s_t,a_i)}{\displaystyle \sum_{i=1}^{|\mathbb{A}|} Q^*_s(s_t,a_i)}
$$

According to me, this is closer to Thompson Sampling.

=== Why use Bernoulli? ===

The choice of having a Bernoulli masking distribution eventually doesn't help them at all, since the algorithm does good because of the initial diversity. Maybe they can use some other masking distribution?

=== Unanswered Questions & Miscellaneous ===
* Why does Thompson DQN perform poorly?
* The actual algorithm is hidden in the appendix. It could have been helpful if it were in the main paper.

== References ==

# [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]
# [https://www.thoughtco.com/what-is-bootstrapping-in-statistics-3126172 Thoughtco, What is bootstrapping in statistics?]
# [https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf Bootstrap confidence intervals, Class 24, 18.05, MIT Open Courseware]
# [https://arxiv.org/abs/1506.02142 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.]
# [https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015]
# Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
# John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.
# S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning, 1993.
# [https://arxiv.org/pdf/1509.06461.pdf Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015.]
# [https://pdfs.semanticscholar.org/d623/c2cbf100d6963ba7dafe55158890d43c78b6.pdf Dean Eckles and Maurits Kaptein, Thompson Sampling with the Online Bootstrap, 2014, Pg 3]
# [https://arxiv.org/abs/1507.00814 Bradly C. Stadie, Sergey Levine, Pieter Abbeel, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015.]
# Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling, NIPS 2013.
# Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension, NIPS 2014.
# [https://arxiv.org/abs/1402.0635 Ian Osband, Benjamin Van Roy, Zheng Wen, Generalization and Exploration via Randomized Value Functions, 2014.]
# Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.
Other helpful links (unsorted):
* [http://pemami4911.github.io/paper-summaries/deep-rl/2016/08/16/Deep-exploration.html pemami4911.github.io]
* [http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf Poisson Approximations]

Deep Exploration via Bootstrapped DQN

2017-11-20T02:45:39Z

A2prasad: /* Critique */

== Details ==

'''Title''': Deep Exploration via Bootstrapped DQN

'''Authors''': Ian Osband {1,2}, Charles Blundell {2}, Alexander Pritzel {2}, Benjamin Van Roy {1}

'''Organisations''':
# Stanford University
# Google Deepmind

'''Conference''': NIPS 2016

'''URL''': [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn papers.nips.cc]

'''Online code sources'''
* [https://github.com/iassael/torch-bootstrapped-dqn github.com/iassael/torch-bootstrapped-dqn]

This summary contains background knowledge from Section 2-7 (except Section 5). Feel free to skip if you already know.

== Intro to Reinforcement Learning ==

In reinforcement learning, an agent interacts with an environment with the goal to maximize its long term reward. A common application of reinforcement learning is to the [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn how to get the maximum long term reward?

There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.

[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]

There are many strategies to approach this '''exploration-exploitation dilemma'''. Some [https://web.stanford.edu/class/msande338/lec9.pdf common strategies] for optimizing in an exploration-exploitation setting are Random Walk, Curiosity-Driven Exploration, and Thompson Sampling. A lot of these approaches are provably efficient, but assume that the state space is not very large. For instance, the approach called Curiosity-Driven Exploration aims to take actions that lead to immediate additional information. This requires the model to search “every possible cell in the grid” which is not desirable if state space is very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases.

This paper tries to use a Thompson sampling like approach to make decisions.

== Thompson Sampling[[#References|[1]]] ==

In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.

Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,

$$
\begin{align*}
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})
\end{align*}
$$

[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]

This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like

$$
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)
$$

=== How can we use this in reinforcement learning? ===

We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.

== Bootstrapping [[#References|[2,3]]] ==

This idea may be unfamiliar to some people, so I thought it would be a good idea to include this. In statistics, bootstrapping is a method to generate new samples from a given sample. Suppose that we have a given population, and we want to study a measure $\theta$. So, we just find $n$ sample points (sample $\{D_i\}_{i=1}^n$), calculate this measure $\hat{\theta}$ for these $n$ points, and make our inference.

If we later wish to find a better bound on $\hat{\theta}$, i.e. suppose we want to say that $\delta_1 \leq \hat{\theta} \leq \delta_2$ with a confidence of $c$, then we can use bootstrapping for this.

Using bootstrapping, we can create a new sample $\{D'_i\}_{i=1}^{n'}$ by '''randomly sampling $n'$ times from $D$, with replacement'''. So, if $D=\{1,2,3,4\}$, a $D'$ of size $n'=10$ could be $\{1,4,4,3,2,2,2,1,3,4\}$. We do this a sufficient $k$ number of times, calculate $\hat{\theta}$ each time, and thus get a distribution $\{\hat{\theta}_i\}_{i=1}^k$. Now, we can choose the $100\cdot c$th and $100\cdot(1-c)$th percentile of this distribution, (let them be $\hat{\theta}_\alpha$ and $\hat{\theta}_\beta$ respectively) and say

$$\hat{\theta}_\alpha \leq \hat{\theta} \leq \hat{\theta}_\beta, \mbox{with confidence }c$$

== Why choose bootstrap and not dropout? ==

There is previous work[[#References|[4]]] that establishes dropout as a good way to train NNs on a posterior such that the trained NN works like a function approximator that is close to the actual posterior. But, there are several problems with the predictions of this trained NN. The figures below are from the appendix of this paper. The left image is the NN trained by the authors of this paper on a sample noisy distribution and the right image is from the accompanying web demo from [[#References|[4]]], where the authors of [[#References|[4]]] show that their NN converges around the mean with a good confidence.

[[File:dropout_results.png|thumb||center||700px|Source: this paper's appendix]]

According to the authors of this paper,
# Even though [[#References|[4]]] says that dropout converges arond the mean, their experiment actually behaves weirdly around a reasonable point like $x=0.75$. They think that this happens because dropout only affects the region local to the original data.
# Samples from the NN trained on the original data do not look like a reasonable posterior (very spiky).
# The trained NN collapses to zero uncertainty at the data points from the original data.

== Q Learning and Deep Q Networks [[#References|[5]]] ==

At any point of time, our rewards dictate what our actions should be. Also, in general, we want good long term rewards. For example, if we are playing a first person shooter game, it is a good idea to go out of cover to kill an enemy, even if some health is lost. Similarly, in reinforcement learning, we want to maximize our long term reward. So if at each time $t$, the reward is $r_t$, then a naive way is to say we want to maximise

$$
R_t = \sum_{i=0}^{\infty}r_t
$$

But, this reward is unbounded. So technically it could tend to $\infty$ in a lot of the cases. This is why we use a '''discounted reward'''.

$$
R_t = \sum_{i=0}^{\infty}\gamma^t r_t
$$

Here, we take $0\leq \gamma \lt 1$. So, what this means is that we value our current reward the most ($r_0$ has a coefficient of $1$), but we also consider the future possible rewards. So if we had two choices: get $+4$ now and $0$ at all other timesteps, or get $-2$ now and $+2$ after $3$ timesteps for $20$ timesteps, we choose the latter ($\gamma=0.9$). This is because $(+4) < (-2)+0.9^3(2+0.9\cdot2+\cdots+0.9^{19}\cdot2)$.

A '''policy''' $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is just a function that tells us what action to take in a given state $s\in \mathbb{S}$. Our goal is to find the best policy $\pi^*$ that maximises the reward from a given state $s$. So, a '''value function''' is defined from $s$ (which the agent is in, at timestep $t$) and following the policy $\pi$ as $V^\pi(s) = \mathbb{E}[R_t]$. The optimal value function is then simply

$$
V^*(s)=\displaystyle\max_{\pi}V^\pi(s)
$$

For convenience however, it is better to work with the '''Q function''' $Q: \mathbb{S}\times\mathbb{A} \rightarrow \mathbb{R}$. $Q$ is defined similarly as $V$. It is the expected return after taking an action $a$ in the given state $s$. So, $Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$. The optimal $Q$ function is

$$
Q^*(s,a)=\displaystyle\max_{\pi}Q^\pi(s,a)
$$

Suppose that we know $Q^*$. Then, if we know that we are supposed to start at $s$ and take an action $a$ right now, what is the best course of action from the next time step? We just choose the optimal action $a'$ at the next state $s'$ that we reach. The optimal action $a'$ at state $s'$ is simply the argument $a_x$ that maximises our $Q^*(s',\cdot)$.

$$
a'=\displaystyle\arg\max_{a_x} Q^*(s',a_x)
$$

So, our best expected reward from $s$ taking action $a$ is $\mathbb{E}[r_t+\gamma\mathbb{E}[R_{t+1}]]$. This is known as the '''Bellman equation''':

$$
Q^*(s,a)=\mathbb{E}[r_t+\gamma \displaystyle\arg\max_{a_x} Q^*(s',a_x)]
$$

In Q learning, we use a deep neural network with weights $\theta$ as a function approximator for $Q^*$. The '''naive way''' to do this is to design a deep neural net that takes as input the state $s$ and action $a$, and produces an approximation to $Q^*$.

* Suppose our neural net weights are $\theta_i$ at iteration $i$.
* We want to train our neural net on the case when we are at $s$, take action $a$, get reward $r$, and reach $s'$.
* To find out what action is best from $s'$, i.e. $a'$, we have to simulate all actions from $s'$. We can do this after we complete this iteration, then run $s',a_x$ for all $a_x\in\mathbb{A}$. But, we don't know how to complete this iteration without knowing this $a'$. So, another way is to simulate all actions from $s'$ using last known set of weights $\theta_{i-1}$. We just simulate state $s'$, action $a_x$ for all $a_x\in\mathbb{A}$ from the previous state and get $Q^*(s',a_x;\theta_{i-1})$. ('''Note''' that some papers do not use the set of weights from the previous iteration $\theta_{i-1}$. Instead they fix the weights for finding the best action for every $\tau$ steps to $\theta^-$, and do $Q^*(s',a_x;\theta^-)$ for $a_x\in\mathbb{A}$ and use this for the target value.)
* Now we can compute our loss function using the Bellman equation, and backpropagate.
$$
\mbox{loss}=\mbox{target}-\mbox{prediction}=(r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1}))-Q^*(s,a;\theta_i)
$$

The '''problem''' with this approach is that at every iteration $i$, we have to do $|\mathbb{A}|$ forward passes on the previous set of weights $\theta_{i-1}$ to find out the best action $a'$ at $s'$. This becomes infeasible quickly with more possible actions.

Authors of [[#References|[5]]] therefore use another kind of architecture. This architecture takes as input the state $s$, and computes the values $Q^*(s,a_x)$ for $a_x\in\mathbb{A}$. So there are $|\mathbb{A}|$ outputs. This basically parallelizes the forward passes so that $r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1})$ can be done with just a single pass through the outputs.

[[File:DQN_arch.png|thumb||||600px|Source: [https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/DQNBreakoutBlocks.png leonardoaraujosantos.gitbooks.io]]]

'''Note:''' When I say state $s$ as an input, I mean some representation of $s$. Since the environment is a partially observable MDP, it is hard to know $s$. So, we can for example, apply a convnet on the frames and get an idea of what the current state is. We pass this output to the input of the DNN (DNN is the fully connected layer for the convnet then).

=== Experience Replay ===

Authors of this paper borrow the concept of experience replay from [[#References|[5,6]]]. In experience replay, we do training in episodes. In each episode, we play and store consecutive $(s,a,r,s')$ tuples in the experience replay buffer. Then after the play, we choose random samples from this buffer and do our training.

Advantages of experience replay over simple online Q learning[[#References|[5]]]:
* '''Better data efficiency''': It is better to use one transition many times to learn again and again, rather than just learn once from it.
* Learning from consecutive samples is difficult because of correlated data. Experience replay breaks this correlation.
* Online learning means the input is decided by the previous action. So, if the maximising action is to go left in some game, next inputs would be about what happens when we go left. This can cause the optimiser to get stuck in a feedback loop, or even diverge, as [[#Reference|[7]]] points out.

== Double Q Learning ==

=== Problem with Q Learning[[#References|[8]]] ===

For a simple neural network, each update tries to shift the current $Q^*$ estimate to a new value:

$$
Q^*(s,a) \leftarrow r+\gamma\displaystyle\max_{a_x}Q^*(s',a_x)
$$

Suppose the neural net has some inherent noise $\epsilon$. So, the neural net actually stores a value $\mathbb{Q}^*$ given by

$$
\mathbb{Q}^* = Q^*+\epsilon
$$

Even if $\epsilon$ has zero mean in the beginning, using the $\max$ operator at the update steps will start propagating $\gamma\cdot\max \mathbb{Q}^*$. This leads to a non zero mean subsequently. The problem is that "max causes overestimation because it does not preserve the zero-mean property of the errors of its operands." ([[#References|[8]]]) Thus, Q learning is more likely to choose overoptimistic values.

=== How does Double Q Learning work? [[#References|[9]]] ===

The problem can be solved by using two sets of weights $\theta$ and $\Theta$. The $\mbox{target}$ can be broken up as

$$
\mbox{target} = r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta) = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta) = r+Q^*(s',a';\theta)
$$

Using double Q learning, we '''select''' the best action using current weights $\theta$ and '''evaluate''' the $Q^*$ value to decide the target value using $\Theta$.

$$
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\Theta) = r+Q^*(s',a';\Theta)
$$

This makes the evaluation fairer.

=== Double Deep Q Learning ===

[[#References|[9]]] further talks about how to use this for deep learning without much additional overhead. The suggestion is to use $\theta^-$ as $\Theta$.

$$
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta^-) = r+Q^*(s',a';\theta^-)
$$

== Bootstrapped DQN ==

The authors propose an architecture that has a shared network and $K$ bootstrap heads. So, suppose our experience buffer $E$ has $n$ datapoints, where each datapoint is a $(s,a,r,s')$ tuple. Each bootstrap head trains on a different buffer $E_i$, where each $E_i$ has been constructed by sampling $n$ datapoints from the original experience buffer $E$ with replacement ('''bootstrap method''').

Because each of the heads train on a different buffer, they model a different $Q^*$ function (say $Q^*_k$). Now, for each episode, we first choose a specific $Q^*_k=Q^*_s$. This $Q^*_s$ helps us create the experience buffer for the episode. From any state $s_t$, we populate the experience buffer by choosing the next action $a_t$ that maximises $Q^*_s$. (similar to '''Thompson Sampling''')

$$
a_t = \displaystyle\arg\max_a Q^*_s(s_t,a_t)
$$

Also, along with $s_t,a_t,r_t,s_{t+1}$, they push a bootstrap mask $m_t$. This mask is basically is a binary vector of size $K$, and it tells which $Q_k$ should be affected by this datapoint, if it is chosen as a training point. So, for example, if $K=5$ and there is a experience tuple $(s_t,a_t,r_t,s_{t+1},m_t)$ where $m_t=(0,1,1,0,1)$, then $(s_t,a_t,r_t,s_{t+1})$ should only affect $Q_2,Q_3$ and $Q_5$.

So, at each iteration, we just choose few points from this buffer and train the respective $Q_{(\cdot)}$ based on the bootstrap masks.

=== How to generate masks? ===

Masks are created by sampling from the '''masking distribution'''. Now, there are many ways to choose this masking distribution:

* If for each datapoint $D_i$ ($i=1$ to $n$), we mask from $\mbox{Bernoulli}(0.5)$, this will roughly allow us to have half the points from the original buffer. To get to size $n$, we duplicate these points by doubling the weights for each datapoint. This essentially gives us a '''double or nothing''' bootstrap[[#References|[10]]].
* If the mask is $(1, 1 \cdots 1)$, then this becomes an '''ensemble learning''' method.
* $m_t~\mbox{Poi}(1)$ (poisson distribution)
* $m_t[k]~\mbox{Exp}(1)$ (exponential distribution)

For this paper's results, the authors used a $\mbox{Bernoulli}(p)$ distribution.

== Related Work ==

The authors mention the method described in [[#References|[11]]]. The authors of [[#References|[11]]] talk about the principle of "optimism in the face of uncertainty" and modify the reward function to encourage state-action pairs that have not been seen often:

$$
R(s,a) \leftarrow R(s,a)+\beta\cdot\mbox{novelty}(s,a)
$$

According to the authors, [[#References|[11]]]'s DQN algorithm relies on a lot of hand tuning and is only good for non stochastic problems. The authors further compare their results to [[#References|[11]]]'s results on Atari.

The authors also mention an existing algorithm PSRL[[#References|[12,13]]], or posterior sampling based RL. However, this algorithm requires a solved MDP, which is not feasible for large systems. Bootstrapped DQN approximates this idea by sampling from approximate $Q^*$ functions.

Further, the authors mention that the work in [[#References|[12,13]]] has been followed by RLSVI[[#Reference|[14]]] which solves the problem for linear cases.

== Deep Exploration: Why is Bootstrapped DQN so good at it? ==

The authors consider a simple example to demonstrate the effectiveness of bootstrapped DQN at deep exploration.

[[File:deep_exploration_example.png|thumb||center||700px|Source: this paper, section 5.1]]

In this example, the agent starts at $s_2$. There are $N$ steps, and $N+9$ timesteps to generate the experience buffer. The agent is said to have learned the optimal policy if it achieves the best possible reward of $10$ (go to the rightmost state in $N-1$ timesteps, then stay there for $10$ timesteps), for at least $100$ such episodes. The results they got:

[[File:deep_exploration_results.png|thumb||center||700px|Source: this paper, section 5.1]]

The blue dots indicate when the agent learnt the optimal policy. If this took more than $2000$ episodes, they indicate it with a red dot. Thompson DQN is DQN with posterior sampling at every timestep. Ensemble DQN is same as bootstrapped DQN except that the mask is all $(1,1 \cdots 1)$. It is evident from the graphs that bootstrapped DQN can achieve deep exploration better than these two methods, and DQN.

=== But why is it better? ===

The authors say that this is because bootstrapped DQN constructs different approximations to the posterior $Q^*$ with the same initial data. This diversity of approximations is because of random initalization of weights for the $Q^*_k$ heads. This means that these heads start out trying random actions (because of diverse random initial $Q^*_k$), but when some head finds a good state and generalises to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually other heads will either find other good states, or end up learning the best good states found by the other heads.

So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.

== Results ==

The authors test their architecture on 49 Atari games. They mention that there has been recent work to improve the performance of DDQNs, but those are tweaks whose intentions are orthogonal to this paper's idea. So, they don't compare their results with them.

=== Scale: What values of $K$, $p$ are best? ===

[[File:scale_k_p.png|thumb||center||800px|Source: this paper, section 6.1]]

Recall that $K$ is the number of bootstrap heads and $p$ is the parameter for the masking distribution (Bernoulli). The authors say that around $K=10$, the performance reaches close to the peak, so it should be good.

$p$ also represents the amount of data sharing. This is because lesser $p$ means there is lesser chance (due to the Bernoulli distribution) that the corresponding datapoint is taken into the bootstrapped dataset $D_i$. So, lesser $p$ means more identical datapoints, hence more heads share their datapoints. However, the value of $p$ doesn't seem to affect the rewards achieved over time. The authors give the following reasons for it:

* The heads start with random weights for $Q^*$, so the targets (which use $Q^*$) turn out to be different. So the update rules are different.
* Atari is deterministic.
* Because of the initial diversity, the heads will learn differently even if they predict the same action for the given state.

$p=1$ is the value they use finally, because this reduces the no. of identical datapoints and reduces time.

=== Performance on Atari ===

In general, the results tell us that bootstrapped DQN achieves better results.

[[File:atari_results_bootstrapped_dqn.png|thumb||center||800px|Source: this paper, section 6.2]]

The authors plot the improvement they achieved with bootstrapped DQN with the games. They define '''improvement''' to be $x$ if bootstrapped DQN achieves a better result than DQN in $\frac{1}{x}$ frames.

[[File:bdqn_improvement.png|thumb||center||1000px|Source: this paper, section 6.2]]

The authors say that bootstrapped DQN doesn't work good on all Atari games. They point out that there are some challenging games, where exploration is key but bootstrapped DQN doesn't do good enough (but does better than DQN). Some of these games are Frostbite and Montezuma’s Revenge. They say that even better exploration may help, but also point out that there may be other problems like: network instability, reward clipping and temporally extended rewards.

=== Improvement: Highest Score Reached & how fast is this high score reached? ===

The authors plot the improvement graphs after 20m and 200m frames.

[[File:cumulative_rewards_bdqn.png|thumb||center||700px|Source: this paper, section 6.3]]

=== Visualisation of Results ===

One of the authors' [https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM youtube playlist] can be found online.

The authors also point out that just purely using bootstrapped DQN as an exploitative strategy is pretty good by itself, better than vanilla DQN. This is because of the deep exploration capabilities of bootstrapped DQN, since it can use the best states it knows and also plan to try out states it doesn't have any information about. Even in the videos, it can be seen that the heads agree at all the crucial decisions, but stay diverse at other less important steps.

== Critique ==

It would be very interesting and a great addition to the the experimental section of the paper, if the authors would have compared with Asynchronous methods of exploration of the state space first introduced in [1]. The authors unfortunately only compared their DQN with the original DQN and not all the other variations in the literature.

=== Different way to do exploration-exploitation? ===

Instead of choosing the next action $a_t$ that maximises $Q^*_s$, they could have chosen different actions $a_i$ with probabilities

$$
\mathbb{P}(s_t,a_i) = \frac{Q^*_s(s_t,a_i)}{\displaystyle \sum_{i=1}^{|\mathbb{A}|} Q^*_s(s_t,a_i)}
$$

According to me, this is closer to Thompson Sampling.

=== Why use Bernoulli? ===

The choice of having a Bernoulli masking distribution eventually doesn't help them at all, since the algorithm does good because of the initial diversity. Maybe they can use some other masking distribution?

=== Unanswered Questions & Miscellaneous ===
* Why does Thompson DQN perform poorly?
* The actual algorithm is hidden in the appendix. It could have been helpful if it were in the main paper.

== References ==

# [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]
# [https://www.thoughtco.com/what-is-bootstrapping-in-statistics-3126172 Thoughtco, What is bootstrapping in statistics?]
# [https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf Bootstrap confidence intervals, Class 24, 18.05, MIT Open Courseware]
# [https://arxiv.org/abs/1506.02142 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.]
# [https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015]
# Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
# John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.
# S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning, 1993.
# [https://arxiv.org/pdf/1509.06461.pdf Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015.]
# [https://pdfs.semanticscholar.org/d623/c2cbf100d6963ba7dafe55158890d43c78b6.pdf Dean Eckles and Maurits Kaptein, Thompson Sampling with the Online Bootstrap, 2014, Pg 3]
# [https://arxiv.org/abs/1507.00814 Bradly C. Stadie, Sergey Levine, Pieter Abbeel, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015.]
# Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling, NIPS 2013.
# Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension, NIPS 2014.
# [https://arxiv.org/abs/1402.0635 Ian Osband, Benjamin Van Roy, Zheng Wen, Generalization and Exploration via Randomized Value Functions, 2014.]
# Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.
Other helpful links (unsorted):
* [http://pemami4911.github.io/paper-summaries/deep-rl/2016/08/16/Deep-exploration.html pemami4911.github.io]
* [http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf Poisson Approximations]

Deep Exploration via Bootstrapped DQN

2017-11-20T02:41:19Z

A2prasad: /* References */

== Details ==

'''Title''': Deep Exploration via Bootstrapped DQN

'''Authors''': Ian Osband {1,2}, Charles Blundell {2}, Alexander Pritzel {2}, Benjamin Van Roy {1}

'''Organisations''':
# Stanford University
# Google Deepmind

'''Conference''': NIPS 2016

'''URL''': [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn papers.nips.cc]

'''Online code sources'''
* [https://github.com/iassael/torch-bootstrapped-dqn github.com/iassael/torch-bootstrapped-dqn]

This summary contains background knowledge from Section 2-7 (except Section 5). Feel free to skip if you already know.

== Intro to Reinforcement Learning ==

In reinforcement learning, an agent interacts with an environment with the goal to maximize its long term reward. A common application of reinforcement learning is to the [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn how to get the maximum long term reward?

There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.

[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]

There are many strategies to approach this '''exploration-exploitation dilemma'''. Some [https://web.stanford.edu/class/msande338/lec9.pdf common strategies] for optimizing in an exploration-exploitation setting are Random Walk, Curiosity-Driven Exploration, and Thompson Sampling. A lot of these approaches are provably efficient, but assume that the state space is not very large. For instance, the approach called Curiosity-Driven Exploration aims to take actions that lead to immediate additional information. This requires the model to search “every possible cell in the grid” which is not desirable if state space is very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases.

This paper tries to use a Thompson sampling like approach to make decisions.

== Thompson Sampling[[#References|[1]]] ==

In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.

Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,

$$
\begin{align*}
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})
\end{align*}
$$

[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]

This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like

$$
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)
$$

=== How can we use this in reinforcement learning? ===

We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.

== Bootstrapping [[#References|[2,3]]] ==

This idea may be unfamiliar to some people, so I thought it would be a good idea to include this. In statistics, bootstrapping is a method to generate new samples from a given sample. Suppose that we have a given population, and we want to study a measure $\theta$. So, we just find $n$ sample points (sample $\{D_i\}_{i=1}^n$), calculate this measure $\hat{\theta}$ for these $n$ points, and make our inference.

If we later wish to find a better bound on $\hat{\theta}$, i.e. suppose we want to say that $\delta_1 \leq \hat{\theta} \leq \delta_2$ with a confidence of $c$, then we can use bootstrapping for this.

Using bootstrapping, we can create a new sample $\{D'_i\}_{i=1}^{n'}$ by '''randomly sampling $n'$ times from $D$, with replacement'''. So, if $D=\{1,2,3,4\}$, a $D'$ of size $n'=10$ could be $\{1,4,4,3,2,2,2,1,3,4\}$. We do this a sufficient $k$ number of times, calculate $\hat{\theta}$ each time, and thus get a distribution $\{\hat{\theta}_i\}_{i=1}^k$. Now, we can choose the $100\cdot c$th and $100\cdot(1-c)$th percentile of this distribution, (let them be $\hat{\theta}_\alpha$ and $\hat{\theta}_\beta$ respectively) and say

$$\hat{\theta}_\alpha \leq \hat{\theta} \leq \hat{\theta}_\beta, \mbox{with confidence }c$$

== Why choose bootstrap and not dropout? ==

There is previous work[[#References|[4]]] that establishes dropout as a good way to train NNs on a posterior such that the trained NN works like a function approximator that is close to the actual posterior. But, there are several problems with the predictions of this trained NN. The figures below are from the appendix of this paper. The left image is the NN trained by the authors of this paper on a sample noisy distribution and the right image is from the accompanying web demo from [[#References|[4]]], where the authors of [[#References|[4]]] show that their NN converges around the mean with a good confidence.

[[File:dropout_results.png|thumb||center||700px|Source: this paper's appendix]]

According to the authors of this paper,
# Even though [[#References|[4]]] says that dropout converges arond the mean, their experiment actually behaves weirdly around a reasonable point like $x=0.75$. They think that this happens because dropout only affects the region local to the original data.
# Samples from the NN trained on the original data do not look like a reasonable posterior (very spiky).
# The trained NN collapses to zero uncertainty at the data points from the original data.

== Q Learning and Deep Q Networks [[#References|[5]]] ==

At any point of time, our rewards dictate what our actions should be. Also, in general, we want good long term rewards. For example, if we are playing a first person shooter game, it is a good idea to go out of cover to kill an enemy, even if some health is lost. Similarly, in reinforcement learning, we want to maximize our long term reward. So if at each time $t$, the reward is $r_t$, then a naive way is to say we want to maximise

$$
R_t = \sum_{i=0}^{\infty}r_t
$$

But, this reward is unbounded. So technically it could tend to $\infty$ in a lot of the cases. This is why we use a '''discounted reward'''.

$$
R_t = \sum_{i=0}^{\infty}\gamma^t r_t
$$

Here, we take $0\leq \gamma \lt 1$. So, what this means is that we value our current reward the most ($r_0$ has a coefficient of $1$), but we also consider the future possible rewards. So if we had two choices: get $+4$ now and $0$ at all other timesteps, or get $-2$ now and $+2$ after $3$ timesteps for $20$ timesteps, we choose the latter ($\gamma=0.9$). This is because $(+4) < (-2)+0.9^3(2+0.9\cdot2+\cdots+0.9^{19}\cdot2)$.

A '''policy''' $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is just a function that tells us what action to take in a given state $s\in \mathbb{S}$. Our goal is to find the best policy $\pi^*$ that maximises the reward from a given state $s$. So, a '''value function''' is defined from $s$ (which the agent is in, at timestep $t$) and following the policy $\pi$ as $V^\pi(s) = \mathbb{E}[R_t]$. The optimal value function is then simply

$$
V^*(s)=\displaystyle\max_{\pi}V^\pi(s)
$$

For convenience however, it is better to work with the '''Q function''' $Q: \mathbb{S}\times\mathbb{A} \rightarrow \mathbb{R}$. $Q$ is defined similarly as $V$. It is the expected return after taking an action $a$ in the given state $s$. So, $Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$. The optimal $Q$ function is

$$
Q^*(s,a)=\displaystyle\max_{\pi}Q^\pi(s,a)
$$

Suppose that we know $Q^*$. Then, if we know that we are supposed to start at $s$ and take an action $a$ right now, what is the best course of action from the next time step? We just choose the optimal action $a'$ at the next state $s'$ that we reach. The optimal action $a'$ at state $s'$ is simply the argument $a_x$ that maximises our $Q^*(s',\cdot)$.

$$
a'=\displaystyle\arg\max_{a_x} Q^*(s',a_x)
$$

So, our best expected reward from $s$ taking action $a$ is $\mathbb{E}[r_t+\gamma\mathbb{E}[R_{t+1}]]$. This is known as the '''Bellman equation''':

$$
Q^*(s,a)=\mathbb{E}[r_t+\gamma \displaystyle\arg\max_{a_x} Q^*(s',a_x)]
$$

In Q learning, we use a deep neural network with weights $\theta$ as a function approximator for $Q^*$. The '''naive way''' to do this is to design a deep neural net that takes as input the state $s$ and action $a$, and produces an approximation to $Q^*$.

* Suppose our neural net weights are $\theta_i$ at iteration $i$.
* We want to train our neural net on the case when we are at $s$, take action $a$, get reward $r$, and reach $s'$.
* To find out what action is best from $s'$, i.e. $a'$, we have to simulate all actions from $s'$. We can do this after we complete this iteration, then run $s',a_x$ for all $a_x\in\mathbb{A}$. But, we don't know how to complete this iteration without knowing this $a'$. So, another way is to simulate all actions from $s'$ using last known set of weights $\theta_{i-1}$. We just simulate state $s'$, action $a_x$ for all $a_x\in\mathbb{A}$ from the previous state and get $Q^*(s',a_x;\theta_{i-1})$. ('''Note''' that some papers do not use the set of weights from the previous iteration $\theta_{i-1}$. Instead they fix the weights for finding the best action for every $\tau$ steps to $\theta^-$, and do $Q^*(s',a_x;\theta^-)$ for $a_x\in\mathbb{A}$ and use this for the target value.)
* Now we can compute our loss function using the Bellman equation, and backpropagate.
$$
\mbox{loss}=\mbox{target}-\mbox{prediction}=(r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1}))-Q^*(s,a;\theta_i)
$$

The '''problem''' with this approach is that at every iteration $i$, we have to do $|\mathbb{A}|$ forward passes on the previous set of weights $\theta_{i-1}$ to find out the best action $a'$ at $s'$. This becomes infeasible quickly with more possible actions.

Authors of [[#References|[5]]] therefore use another kind of architecture. This architecture takes as input the state $s$, and computes the values $Q^*(s,a_x)$ for $a_x\in\mathbb{A}$. So there are $|\mathbb{A}|$ outputs. This basically parallelizes the forward passes so that $r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1})$ can be done with just a single pass through the outputs.

[[File:DQN_arch.png|thumb||||600px|Source: [https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/DQNBreakoutBlocks.png leonardoaraujosantos.gitbooks.io]]]

'''Note:''' When I say state $s$ as an input, I mean some representation of $s$. Since the environment is a partially observable MDP, it is hard to know $s$. So, we can for example, apply a convnet on the frames and get an idea of what the current state is. We pass this output to the input of the DNN (DNN is the fully connected layer for the convnet then).

=== Experience Replay ===

Authors of this paper borrow the concept of experience replay from [[#References|[5,6]]]. In experience replay, we do training in episodes. In each episode, we play and store consecutive $(s,a,r,s')$ tuples in the experience replay buffer. Then after the play, we choose random samples from this buffer and do our training.

Advantages of experience replay over simple online Q learning[[#References|[5]]]:
* '''Better data efficiency''': It is better to use one transition many times to learn again and again, rather than just learn once from it.
* Learning from consecutive samples is difficult because of correlated data. Experience replay breaks this correlation.
* Online learning means the input is decided by the previous action. So, if the maximising action is to go left in some game, next inputs would be about what happens when we go left. This can cause the optimiser to get stuck in a feedback loop, or even diverge, as [[#Reference|[7]]] points out.

== Double Q Learning ==

=== Problem with Q Learning[[#References|[8]]] ===

For a simple neural network, each update tries to shift the current $Q^*$ estimate to a new value:

$$
Q^*(s,a) \leftarrow r+\gamma\displaystyle\max_{a_x}Q^*(s',a_x)
$$

Suppose the neural net has some inherent noise $\epsilon$. So, the neural net actually stores a value $\mathbb{Q}^*$ given by

$$
\mathbb{Q}^* = Q^*+\epsilon
$$

Even if $\epsilon$ has zero mean in the beginning, using the $\max$ operator at the update steps will start propagating $\gamma\cdot\max \mathbb{Q}^*$. This leads to a non zero mean subsequently. The problem is that "max causes overestimation because it does not preserve the zero-mean property of the errors of its operands." ([[#References|[8]]]) Thus, Q learning is more likely to choose overoptimistic values.

=== How does Double Q Learning work? [[#References|[9]]] ===

The problem can be solved by using two sets of weights $\theta$ and $\Theta$. The $\mbox{target}$ can be broken up as

$$
\mbox{target} = r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta) = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta) = r+Q^*(s',a';\theta)
$$

Using double Q learning, we '''select''' the best action using current weights $\theta$ and '''evaluate''' the $Q^*$ value to decide the target value using $\Theta$.

$$
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\Theta) = r+Q^*(s',a';\Theta)
$$

This makes the evaluation fairer.

=== Double Deep Q Learning ===

[[#References|[9]]] further talks about how to use this for deep learning without much additional overhead. The suggestion is to use $\theta^-$ as $\Theta$.

$$
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta^-) = r+Q^*(s',a';\theta^-)
$$

== Bootstrapped DQN ==

The authors propose an architecture that has a shared network and $K$ bootstrap heads. So, suppose our experience buffer $E$ has $n$ datapoints, where each datapoint is a $(s,a,r,s')$ tuple. Each bootstrap head trains on a different buffer $E_i$, where each $E_i$ has been constructed by sampling $n$ datapoints from the original experience buffer $E$ with replacement ('''bootstrap method''').

Because each of the heads train on a different buffer, they model a different $Q^*$ function (say $Q^*_k$). Now, for each episode, we first choose a specific $Q^*_k=Q^*_s$. This $Q^*_s$ helps us create the experience buffer for the episode. From any state $s_t$, we populate the experience buffer by choosing the next action $a_t$ that maximises $Q^*_s$. (similar to '''Thompson Sampling''')

$$
a_t = \displaystyle\arg\max_a Q^*_s(s_t,a_t)
$$

Also, along with $s_t,a_t,r_t,s_{t+1}$, they push a bootstrap mask $m_t$. This mask is basically is a binary vector of size $K$, and it tells which $Q_k$ should be affected by this datapoint, if it is chosen as a training point. So, for example, if $K=5$ and there is a experience tuple $(s_t,a_t,r_t,s_{t+1},m_t)$ where $m_t=(0,1,1,0,1)$, then $(s_t,a_t,r_t,s_{t+1})$ should only affect $Q_2,Q_3$ and $Q_5$.

So, at each iteration, we just choose few points from this buffer and train the respective $Q_{(\cdot)}$ based on the bootstrap masks.

=== How to generate masks? ===

Masks are created by sampling from the '''masking distribution'''. Now, there are many ways to choose this masking distribution:

* If for each datapoint $D_i$ ($i=1$ to $n$), we mask from $\mbox{Bernoulli}(0.5)$, this will roughly allow us to have half the points from the original buffer. To get to size $n$, we duplicate these points by doubling the weights for each datapoint. This essentially gives us a '''double or nothing''' bootstrap[[#References|[10]]].
* If the mask is $(1, 1 \cdots 1)$, then this becomes an '''ensemble learning''' method.
* $m_t~\mbox{Poi}(1)$ (poisson distribution)
* $m_t[k]~\mbox{Exp}(1)$ (exponential distribution)

For this paper's results, the authors used a $\mbox{Bernoulli}(p)$ distribution.

== Related Work ==

The authors mention the method described in [[#References|[11]]]. The authors of [[#References|[11]]] talk about the principle of "optimism in the face of uncertainty" and modify the reward function to encourage state-action pairs that have not been seen often:

$$
R(s,a) \leftarrow R(s,a)+\beta\cdot\mbox{novelty}(s,a)
$$

According to the authors, [[#References|[11]]]'s DQN algorithm relies on a lot of hand tuning and is only good for non stochastic problems. The authors further compare their results to [[#References|[11]]]'s results on Atari.

The authors also mention an existing algorithm PSRL[[#References|[12,13]]], or posterior sampling based RL. However, this algorithm requires a solved MDP, which is not feasible for large systems. Bootstrapped DQN approximates this idea by sampling from approximate $Q^*$ functions.

Further, the authors mention that the work in [[#References|[12,13]]] has been followed by RLSVI[[#Reference|[14]]] which solves the problem for linear cases.

== Deep Exploration: Why is Bootstrapped DQN so good at it? ==

The authors consider a simple example to demonstrate the effectiveness of bootstrapped DQN at deep exploration.

[[File:deep_exploration_example.png|thumb||center||700px|Source: this paper, section 5.1]]

In this example, the agent starts at $s_2$. There are $N$ steps, and $N+9$ timesteps to generate the experience buffer. The agent is said to have learned the optimal policy if it achieves the best possible reward of $10$ (go to the rightmost state in $N-1$ timesteps, then stay there for $10$ timesteps), for at least $100$ such episodes. The results they got:

[[File:deep_exploration_results.png|thumb||center||700px|Source: this paper, section 5.1]]

The blue dots indicate when the agent learnt the optimal policy. If this took more than $2000$ episodes, they indicate it with a red dot. Thompson DQN is DQN with posterior sampling at every timestep. Ensemble DQN is same as bootstrapped DQN except that the mask is all $(1,1 \cdots 1)$. It is evident from the graphs that bootstrapped DQN can achieve deep exploration better than these two methods, and DQN.

=== But why is it better? ===

The authors say that this is because bootstrapped DQN constructs different approximations to the posterior $Q^*$ with the same initial data. This diversity of approximations is because of random initalization of weights for the $Q^*_k$ heads. This means that these heads start out trying random actions (because of diverse random initial $Q^*_k$), but when some head finds a good state and generalises to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually other heads will either find other good states, or end up learning the best good states found by the other heads.

So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.

== Results ==

The authors test their architecture on 49 Atari games. They mention that there has been recent work to improve the performance of DDQNs, but those are tweaks whose intentions are orthogonal to this paper's idea. So, they don't compare their results with them.

=== Scale: What values of $K$, $p$ are best? ===

[[File:scale_k_p.png|thumb||center||800px|Source: this paper, section 6.1]]

Recall that $K$ is the number of bootstrap heads and $p$ is the parameter for the masking distribution (Bernoulli). The authors say that around $K=10$, the performance reaches close to the peak, so it should be good.

$p$ also represents the amount of data sharing. This is because lesser $p$ means there is lesser chance (due to the Bernoulli distribution) that the corresponding datapoint is taken into the bootstrapped dataset $D_i$. So, lesser $p$ means more identical datapoints, hence more heads share their datapoints. However, the value of $p$ doesn't seem to affect the rewards achieved over time. The authors give the following reasons for it:

* The heads start with random weights for $Q^*$, so the targets (which use $Q^*$) turn out to be different. So the update rules are different.
* Atari is deterministic.
* Because of the initial diversity, the heads will learn differently even if they predict the same action for the given state.

$p=1$ is the value they use finally, because this reduces the no. of identical datapoints and reduces time.

=== Performance on Atari ===

In general, the results tell us that bootstrapped DQN achieves better results.

[[File:atari_results_bootstrapped_dqn.png|thumb||center||800px|Source: this paper, section 6.2]]

The authors plot the improvement they achieved with bootstrapped DQN with the games. They define '''improvement''' to be $x$ if bootstrapped DQN achieves a better result than DQN in $\frac{1}{x}$ frames.

[[File:bdqn_improvement.png|thumb||center||1000px|Source: this paper, section 6.2]]

The authors say that bootstrapped DQN doesn't work good on all Atari games. They point out that there are some challenging games, where exploration is key but bootstrapped DQN doesn't do good enough (but does better than DQN). Some of these games are Frostbite and Montezuma’s Revenge. They say that even better exploration may help, but also point out that there may be other problems like: network instability, reward clipping and temporally extended rewards.

=== Improvement: Highest Score Reached & how fast is this high score reached? ===

The authors plot the improvement graphs after 20m and 200m frames.

[[File:cumulative_rewards_bdqn.png|thumb||center||700px|Source: this paper, section 6.3]]

=== Visualisation of Results ===

One of the authors' [https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM youtube playlist] can be found online.

The authors also point out that just purely using bootstrapped DQN as an exploitative strategy is pretty good by itself, better than vanilla DQN. This is because of the deep exploration capabilities of bootstrapped DQN, since it can use the best states it knows and also plan to try out states it doesn't have any information about. Even in the videos, it can be seen that the heads agree at all the crucial decisions, but stay diverse at other less important steps.

== Critique ==

=== Different way to do exploration-exploitation? ===

Instead of choosing the next action $a_t$ that maximises $Q^*_s$, they could have chosen different actions $a_i$ with probabilities

$$
\mathbb{P}(s_t,a_i) = \frac{Q^*_s(s_t,a_i)}{\displaystyle \sum_{i=1}^{|\mathbb{A}|} Q^*_s(s_t,a_i)}
$$

According to me, this is closer to Thompson Sampling.

=== Why use Bernoulli? ===

The choice of having a Bernoulli masking distribution eventually doesn't help them at all, since the algorithm does good because of the initial diversity. Maybe they can use some other masking distribution?

=== Unanswered Questions & Miscellaneous ===
* Why does Thompson DQN perform poorly?
* The actual algorithm is hidden in the appendix. It could have been helpful if it were in the main paper.

== References ==

# [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]
# [https://www.thoughtco.com/what-is-bootstrapping-in-statistics-3126172 Thoughtco, What is bootstrapping in statistics?]
# [https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf Bootstrap confidence intervals, Class 24, 18.05, MIT Open Courseware]
# [https://arxiv.org/abs/1506.02142 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.]
# [https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015]
# Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
# John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.
# S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning, 1993.
# [https://arxiv.org/pdf/1509.06461.pdf Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015.]
# [https://pdfs.semanticscholar.org/d623/c2cbf100d6963ba7dafe55158890d43c78b6.pdf Dean Eckles and Maurits Kaptein, Thompson Sampling with the Online Bootstrap, 2014, Pg 3]
# [https://arxiv.org/abs/1507.00814 Bradly C. Stadie, Sergey Levine, Pieter Abbeel, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015.]
# Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling, NIPS 2013.
# Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension, NIPS 2014.
# [https://arxiv.org/abs/1402.0635 Ian Osband, Benjamin Van Roy, Zheng Wen, Generalization and Exploration via Randomized Value Functions, 2014.]
# Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.
Other helpful links (unsorted):
* [http://pemami4911.github.io/paper-summaries/deep-rl/2016/08/16/Deep-exploration.html pemami4911.github.io]
* [http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf Poisson Approximations]

Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition

2017-11-20T01:36:11Z

A2prasad: /* Related Work */

==Introduction==

Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages.
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]

The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs.

The main contributions in the paper can be summarized as follows:
* A Deep Alternative Neural Network (DANN) is proposed for action recognition.
* DANN consists of alternative volumetric convolutional and recurrent layers.
* An adaptive method to determine the temporal size of the video clip
* A volumetric pyramid pooling layer to resize the output before fully connected layers.

===Related Work===
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.

=== Optic Flow ===
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
It can be used for affordance perception, the ability to discern possibilities for action within the environment.

==Deep Alternative Neural Network:==
===Adaptive Network Input===
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with shorter interval. However, there’s still no systematic way of determining number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to:
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion
* it is relatively robust to changes in camera viewpoint.

The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows

:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math>

Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.

[[File:golfswing.png]]

===Alternative Layer===
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t is given by,

:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz})
</math>

U(0): feed forward output of volumetric convolutional layer.
U(t-1) : recurrent input of previous time
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively
f: ReLU function

Figure 3 depicts this structure:
[[File:unfolded.PNG|1000px]]

===Volumetric Pyramid Pooling Layer===

[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:

M: number of bins

K: Number of kernels in the last alternative layer.

This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.

==Overall Architecture==
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]
The following are the components of the DANN (as shown in Figure 3)
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps
* 5 ReLU and volumetric pooling layers
* 1 volumetric pyramid pooling layer
* 3 fully connected layers of size 2048 each
* A softmax layer

==Implementation details==
The authors have used the Torch toolbox platform for Implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].

==Evaluations==
===Datasets:===
* The datasets used in the evaluation are UCF101 and HMDB51
* UCF101 – 13K videos annotated into 101 classes
* HMDB51 – 6.8K videos with 51 actions.
* Three training and test splits are provided
* Performance measured by mean classification accuracy across the splits.
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.

===Quantitative Results===
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.

[[File:Performance Comparison of different input modalities.png]]

===Qualitative Analysis===
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.

==Conclusion==
* Deep alternative neural network is introduced for action recognition.
* DANN consists of volumetric convolutional layer and a recurrent layer.
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches.
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement.
* There are prospects for studying action tube which is a more compact input.

Github code: https://github.com/wangjinzhuo/DANN

==References==

[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014

[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.

[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.

[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.

[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016.

[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015.

[8] IEEE International Symposium on Multimedia 2013

[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action
recognition. arXiv preprint arXiv:1604.04494, 2016

[10] https://en.wikipedia.org/wiki/Optical_flow

[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016

[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l
1 optical flow. In Pattern Recognition, pages 214–223. 2007.

A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html

Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition

2017-11-20T01:35:38Z

A2prasad: /* Related Work */

==Introduction==

Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages.
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]

The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs.

The main contributions in the paper can be summarized as follows:
* A Deep Alternative Neural Network (DANN) is proposed for action recognition.
* DANN consists of alternative volumetric convolutional and recurrent layers.
* An adaptive method to determine the temporal size of the video clip
* A volumetric pyramid pooling layer to resize the output before fully connected layers.

===Related Work===
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This relation was noted by Reviewer 1 in the NIPS review process.

=== Optic Flow ===
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
It can be used for affordance perception, the ability to discern possibilities for action within the environment.

==Deep Alternative Neural Network:==
===Adaptive Network Input===
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with shorter interval. However, there’s still no systematic way of determining number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to:
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion
* it is relatively robust to changes in camera viewpoint.

The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows

:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math>

Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.

[[File:golfswing.png]]

===Alternative Layer===
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t is given by,

:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz})
</math>

U(0): feed forward output of volumetric convolutional layer.
U(t-1) : recurrent input of previous time
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively
f: ReLU function

Figure 3 depicts this structure:
[[File:unfolded.PNG|1000px]]

===Volumetric Pyramid Pooling Layer===

[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:

M: number of bins

K: Number of kernels in the last alternative layer.

This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.

==Overall Architecture==
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]
The following are the components of the DANN (as shown in Figure 3)
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps
* 5 ReLU and volumetric pooling layers
* 1 volumetric pyramid pooling layer
* 3 fully connected layers of size 2048 each
* A softmax layer

==Implementation details==
The authors have used the Torch toolbox platform for Implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].

==Evaluations==
===Datasets:===
* The datasets used in the evaluation are UCF101 and HMDB51
* UCF101 – 13K videos annotated into 101 classes
* HMDB51 – 6.8K videos with 51 actions.
* Three training and test splits are provided
* Performance measured by mean classification accuracy across the splits.
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.

===Quantitative Results===
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.

[[File:Performance Comparison of different input modalities.png]]

===Qualitative Analysis===
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.

==Conclusion==
* Deep alternative neural network is introduced for action recognition.
* DANN consists of volumetric convolutional layer and a recurrent layer.
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches.
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement.
* There are prospects for studying action tube which is a more compact input.

Github code: https://github.com/wangjinzhuo/DANN

==References==

[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014

[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.

[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.

[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.

[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016.

[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015.

[8] IEEE International Symposium on Multimedia 2013

[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action
recognition. arXiv preprint arXiv:1604.04494, 2016

[10] https://en.wikipedia.org/wiki/Optical_flow

[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016

[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l
1 optical flow. In Pattern Recognition, pages 214–223. 2007.

A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html

Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition

2017-11-20T01:35:23Z

A2prasad: /* Related Work */

==Introduction==

Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages.
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]

The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs.

The main contributions in the paper can be summarized as follows:
* A Deep Alternative Neural Network (DANN) is proposed for action recognition.
* DANN consists of alternative volumetric convolutional and recurrent layers.
* An adaptive method to determine the temporal size of the video clip
* A volumetric pyramid pooling layer to resize the output before fully connected layers.

==Related Work==
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This relation was noted by Reviewer 1 in the NIPS review process.

=== Optic Flow ===
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
It can be used for affordance perception, the ability to discern possibilities for action within the environment.

==Deep Alternative Neural Network:==
===Adaptive Network Input===
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with shorter interval. However, there’s still no systematic way of determining number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to:
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion
* it is relatively robust to changes in camera viewpoint.

The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows

:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math>

Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.

[[File:golfswing.png]]

===Alternative Layer===
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t is given by,

:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz})
</math>

U(0): feed forward output of volumetric convolutional layer.
U(t-1) : recurrent input of previous time
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively
f: ReLU function

Figure 3 depicts this structure:
[[File:unfolded.PNG|1000px]]

===Volumetric Pyramid Pooling Layer===

[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:

M: number of bins

K: Number of kernels in the last alternative layer.

This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.

==Overall Architecture==
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]
The following are the components of the DANN (as shown in Figure 3)
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps
* 5 ReLU and volumetric pooling layers
* 1 volumetric pyramid pooling layer
* 3 fully connected layers of size 2048 each
* A softmax layer

==Implementation details==
The authors have used the Torch toolbox platform for Implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].

==Evaluations==
===Datasets:===
* The datasets used in the evaluation are UCF101 and HMDB51
* UCF101 – 13K videos annotated into 101 classes
* HMDB51 – 6.8K videos with 51 actions.
* Three training and test splits are provided
* Performance measured by mean classification accuracy across the splits.
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.

===Quantitative Results===
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.

[[File:Performance Comparison of different input modalities.png]]

===Qualitative Analysis===
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.

==Conclusion==
* Deep alternative neural network is introduced for action recognition.
* DANN consists of volumetric convolutional layer and a recurrent layer.
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches.
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement.
* There are prospects for studying action tube which is a more compact input.

Github code: https://github.com/wangjinzhuo/DANN

==References==

[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014

[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.

[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.

[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.

[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016.

[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015.

[8] IEEE International Symposium on Multimedia 2013

[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action
recognition. arXiv preprint arXiv:1604.04494, 2016

[10] https://en.wikipedia.org/wiki/Optical_flow

[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016

[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l
1 optical flow. In Pattern Recognition, pages 214–223. 2007.

A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html

Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition

2017-11-20T01:35:11Z

A2prasad: /* References */

==Introduction==

Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages.
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]

The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs.

The main contributions in the paper can be summarized as follows:
* A Deep Alternative Neural Network (DANN) is proposed for action recognition.
* DANN consists of alternative volumetric convolutional and recurrent layers.
* An adaptive method to determine the temporal size of the video clip
* A volumetric pyramid pooling layer to resize the output before fully connected layers.

==Related Work==
There are already exists a very related paper ([14]) in the literature which proposed a similar alternation architecture. In particular the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This relation was noted by Reviewer 1 in the NIPS review process.

=== Optic Flow ===
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
It can be used for affordance perception, the ability to discern possibilities for action within the environment.

==Deep Alternative Neural Network:==
===Adaptive Network Input===
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with shorter interval. However, there’s still no systematic way of determining number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to:
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion
* it is relatively robust to changes in camera viewpoint.

The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows

:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math>

Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.

[[File:golfswing.png]]

===Alternative Layer===
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t is given by,

:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz})
</math>

U(0): feed forward output of volumetric convolutional layer.
U(t-1) : recurrent input of previous time
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively
f: ReLU function

Figure 3 depicts this structure:
[[File:unfolded.PNG|1000px]]

===Volumetric Pyramid Pooling Layer===

[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:

M: number of bins

K: Number of kernels in the last alternative layer.

This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.

==Overall Architecture==
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]
The following are the components of the DANN (as shown in Figure 3)
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps
* 5 ReLU and volumetric pooling layers
* 1 volumetric pyramid pooling layer
* 3 fully connected layers of size 2048 each
* A softmax layer

==Implementation details==
The authors have used the Torch toolbox platform for Implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].

==Evaluations==
===Datasets:===
* The datasets used in the evaluation are UCF101 and HMDB51
* UCF101 – 13K videos annotated into 101 classes
* HMDB51 – 6.8K videos with 51 actions.
* Three training and test splits are provided
* Performance measured by mean classification accuracy across the splits.
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.

===Quantitative Results===
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.

[[File:Performance Comparison of different input modalities.png]]

===Qualitative Analysis===
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.

==Conclusion==
* Deep alternative neural network is introduced for action recognition.
* DANN consists of volumetric convolutional layer and a recurrent layer.
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches.
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement.
* There are prospects for studying action tube which is a more compact input.

Github code: https://github.com/wangjinzhuo/DANN

==References==

[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014

[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.

[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.

[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.

[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016.

[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015.

[8] IEEE International Symposium on Multimedia 2013

[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action
recognition. arXiv preprint arXiv:1604.04494, 2016

[10] https://en.wikipedia.org/wiki/Optical_flow

[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016

[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l
1 optical flow. In Pattern Recognition, pages 214–223. 2007.

A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html

Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition

2017-11-20T01:34:52Z

A2prasad: /* References */

==Introduction==

Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages.
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]

The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs.

The main contributions in the paper can be summarized as follows:
* A Deep Alternative Neural Network (DANN) is proposed for action recognition.
* DANN consists of alternative volumetric convolutional and recurrent layers.
* An adaptive method to determine the temporal size of the video clip
* A volumetric pyramid pooling layer to resize the output before fully connected layers.

==Related Work==
There are already exists a very related paper ([14]) in the literature which proposed a similar alternation architecture. In particular the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This relation was noted by Reviewer 1 in the NIPS review process.

=== Optic Flow ===
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
It can be used for affordance perception, the ability to discern possibilities for action within the environment.

==Deep Alternative Neural Network:==
===Adaptive Network Input===
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with shorter interval. However, there’s still no systematic way of determining number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to:
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion
* it is relatively robust to changes in camera viewpoint.

The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows

:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math>

Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.

[[File:golfswing.png]]

===Alternative Layer===
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t is given by,

:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz})
</math>

U(0): feed forward output of volumetric convolutional layer.
U(t-1) : recurrent input of previous time
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively
f: ReLU function

Figure 3 depicts this structure:
[[File:unfolded.PNG|1000px]]

===Volumetric Pyramid Pooling Layer===

[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:

M: number of bins

K: Number of kernels in the last alternative layer.

This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.

==Overall Architecture==
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]
The following are the components of the DANN (as shown in Figure 3)
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps
* 5 ReLU and volumetric pooling layers
* 1 volumetric pyramid pooling layer
* 3 fully connected layers of size 2048 each
* A softmax layer

==Implementation details==
The authors have used the Torch toolbox platform for Implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].

==Evaluations==
===Datasets:===
* The datasets used in the evaluation are UCF101 and HMDB51
* UCF101 – 13K videos annotated into 101 classes
* HMDB51 – 6.8K videos with 51 actions.
* Three training and test splits are provided
* Performance measured by mean classification accuracy across the splits.
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.

===Quantitative Results===
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.

[[File:Performance Comparison of different input modalities.png]]

===Qualitative Analysis===
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.

==Conclusion==
* Deep alternative neural network is introduced for action recognition.
* DANN consists of volumetric convolutional layer and a recurrent layer.
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches.
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement.
* There are prospects for studying action tube which is a more compact input.

Github code: https://github.com/wangjinzhuo/DANN

==References==

[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014

[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.

[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.

[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.

[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016.

[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015.

[8] IEEE International Symposium on Multimedia 2013

[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action
recognition. arXiv preprint arXiv:1604.04494, 2016

[10] https://en.wikipedia.org/wiki/Optical_flow
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016

[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l
1 optical flow. In Pattern Recognition, pages 214–223. 2007.

A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html

Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition

2017-11-20T01:33:51Z

A2prasad:

==Introduction==

Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages.
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]

The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs.

The main contributions in the paper can be summarized as follows:
* A Deep Alternative Neural Network (DANN) is proposed for action recognition.
* DANN consists of alternative volumetric convolutional and recurrent layers.
* An adaptive method to determine the temporal size of the video clip
* A volumetric pyramid pooling layer to resize the output before fully connected layers.

==Related Work==
There are already exists a very related paper ([14]) in the literature which proposed a similar alternation architecture. In particular the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This relation was noted by Reviewer 1 in the NIPS review process.

=== Optic Flow ===
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
It can be used for affordance perception, the ability to discern possibilities for action within the environment.

==Deep Alternative Neural Network:==
===Adaptive Network Input===
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with shorter interval. However, there’s still no systematic way of determining number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to:
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion
* it is relatively robust to changes in camera viewpoint.

The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows

:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math>

Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.

[[File:golfswing.png]]

===Alternative Layer===
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t is given by,

:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz})
</math>

U(0): feed forward output of volumetric convolutional layer.
U(t-1) : recurrent input of previous time
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively
f: ReLU function

Figure 3 depicts this structure:
[[File:unfolded.PNG|1000px]]

===Volumetric Pyramid Pooling Layer===

[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:

M: number of bins

K: Number of kernels in the last alternative layer.

This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.

==Overall Architecture==
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]
The following are the components of the DANN (as shown in Figure 3)
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps
* 5 ReLU and volumetric pooling layers
* 1 volumetric pyramid pooling layer
* 3 fully connected layers of size 2048 each
* A softmax layer

==Implementation details==
The authors have used the Torch toolbox platform for Implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].

==Evaluations==
===Datasets:===
* The datasets used in the evaluation are UCF101 and HMDB51
* UCF101 – 13K videos annotated into 101 classes
* HMDB51 – 6.8K videos with 51 actions.
* Three training and test splits are provided
* Performance measured by mean classification accuracy across the splits.
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.

===Quantitative Results===
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.

[[File:Performance Comparison of different input modalities.png]]

===Qualitative Analysis===
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.

==Conclusion==
* Deep alternative neural network is introduced for action recognition.
* DANN consists of volumetric convolutional layer and a recurrent layer.
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches.
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement.
* There are prospects for studying action tube which is a more compact input.

Github code: https://github.com/wangjinzhuo/DANN

==References==

[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014

[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.

[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.

[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.

[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016.

[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015.

[8] IEEE International Symposium on Multimedia 2013

[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action
recognition. arXiv preprint arXiv:1604.04494, 2016

[10] https://en.wikipedia.org/wiki/Optical_flow

[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l
1 optical flow. In Pattern Recognition, pages 214–223. 2007.

A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html

Universal Style Transfer via Feature Transforms

2017-11-20T00:56:39Z

A2prasad: /* Additional Results and Figures */

=Introduction=
When viewing an image, whether it is a photograph or a painting, two types of mutually exclusive data are present. First, there is the content of the image, such as a person in a portrait. However, the content does not uniquely define the image. Consider a case where multiple artists paint a portrait of an identical subject, the results would vary despite the content being invariant. The cause of the variance is rooted in the style of each particular artist. Therefore, style transfer between two images results in the content being unaffected but the style being copied. Style transfer is an important image editing task which enables the creation of new artistic works. Typically one image is termed the content/reference image, whose style is discarded. The other image is called the style image, whose style, but the not content is copied to the content image.

Deep learning techniques have been shown to be effective methods for implementing style transfer. Previous methods have been successful but with several key limitations and often trade off between generalization, quality and efficiency. Either they are fast, but have very few styles that can be transferred or they can handle arbitrary styles but are no longer efficient. The presented paper establishes a compromise between these two extremes by using only whitening and coloring transforms (WCT) to transfer a style within a feedforward image reconstruction architecture. No training of the underlying deep network is required per style.

=Related Work=
Gatys et al. developed a new method for generating textures from sample images in 2015 [1] and extended their approach to style transfer by 2016 [2]. They proposed the use of a pre-trained convolutional neural network (CNN) to separate content and style of input images. Having proven successful, a number of improvements quickly developed, reducing computational time, increasing the diversity of transferrable styes, and improving the quality of the results. Central to these approaches and of the present paper is the use of a CNN.

In 2017, Mechrez et al. [12] proposed an approach that takes as input a stylized image and makes it more photorealistic. Their approach relied on the Screened Poisson Equation, maintaining the fidelity of the stylized image while constraining the gradients to those of the original input image. The method they proposed was fast, simple, fully automatic and showed positive progress in making a stylized image photorealistic.

Alternative attempts, by using a single network to transfer
multiple styles, include models conditioned on binary selection units [13], a network that learns a set of new filters for every new style [15], and a novel conditional normalization layer that learns normalization parameters for each style [3]
==How Content and Style are Extracted using CNNs==
A CNN was chosen due to its ability to extract high level feature from images. These features can be interpreted in two ways. Within layer <math> l </math> there are <math> N_l </math> feature maps of size <math> M_l </math>. With a particular input image, the feature maps are given by <math> F_{i,j}^l </math> where <math> i </math> and <math> j </math> locate the map within the layer. Starting with a white noise image and an reference (content) image, the features can be transferred by minimizing

<center>
<math> \mathcal{L}_{content} = \frac{1}{2} \sum_{i,j} \left( F_{i,j}^l - P_{i,j}^l \right)^2 </math>
</center>

where <math> P_{i,j} </math> denotes the feature map output caused by the white noise image. Therefore this loss function preserves the content of the reference image. The style is described using a Gram matrix given by

<center>
<math>
G_{i,j}^l = \sum_k F_{i,k}^l F_{j,k}^l
</math>
</center>

Gram matrix $G$ of a set of vectors $v_1,\dots,v_n$ is the matrix of all possible inner products whose entries are given by $G_{ij}=v_i^Tv_j$. The loss function that describes a difference in style between two images is equal to:

<center>
<math>
\mathcal{L}_{style} = \frac{1}{4 N_l^2 M_l^2} \sum_{i,j} \left(G_{i,j}^l - A_{i,j}^l \right)^2
</math>
</center>

where <math> A_{i,j}^l </math> and <math> G_{i,j}^l </math> are the Gram matrices of the generated image and style image respectively. Therefore three images are required, a style image, a content image and an initial white noise image. Iterative optimization is then used to add content from one image to the white noise image, and style from the other. An additional parameter is used to balance the ratio of these loss functions.

The 19-layer ImageNet trained VGG network was chosen by Gatys et al. VGG-19 is still commonly used in more recent works as will be shown in the presented paper, although training datasets vary. Such CNNs are typically used in classification problems by finalizing their output through a series of full connected layers. For content and style extraction it is the convolutional layers that are required. The method of Gatys et al. is style independent, since the CNN does not need to be trained for each style image. However the process of iterative optimization to generate the output image is computational expensive.

==Other Methods==
Other methods avoid the inefficiency of iterative optimization by training a network/networks on a set of styles. The network then directly transfers the style from the style image to the content image without solving the iterative optimization problem. V. Dumoulin et al. trained a single network on $N$ styles [3]. This improved upon previous work where a network was required per style [4]. The stylized output image was generated by simply running a feedforward pass of the network on the content image. While efficiency is high, the method is no longer able to apply an arbitrary style without retraining.

=Methodology=
Li et al. have proposed a novel method for generating the stylized image. A CNN is still used as in Gatys et al. to extract content and style. However, the stylized image is not generated through iterative optimization or a feed-forward pass as required by previous methods. Instead, whitening and colour transforms are used.

==Image Reconstruction==
[[File:image_resconstruction.png|thumb|150px|right|alt=Training a single decoder.|Training a single decoder. X denotes the layer of the VGG encoder that the decoder receives as input.]]
An auto-encoder network is used to first encode an input image into a set of feature maps, and then decode it back to an image as shown in the adjacent figure. The encoder network used is VGG-19. This network is reponsible for obtaining feature maps (similar to Gatys et al.). The output of each of the first five layers is then fed into a corresponding decoder network, which is a mirrored version of VGG-19. Each decoder network then decodes the feature maps of the $l$th layer producing an output image. A mechanism for transferring style will be implemented by manipulating the feature maps between the encoder and decoder networks.

First, the auto-encoder network needs to be trained. The following loss function is used

<center>
<math>

\mathcal{L} = || I_{output} - I_{input} ||_2^2 + \lambda || \Phi(I_{output}) - \Phi(I_{input})||_2^2

</math>
</center>

where $I_{input}$ and $I_{output}$ are the input and output images of the auto-encoder. $\Phi$ is the VGG encoder. The first term of the loss is the pixel reconstruction loss, while the second term is feature loss. Recall from "Related Work" that the feature maps correspond to the content of the image. Therefore the second term can also be seen as penalising for content differences that arise due to the encoder network. The network was trained using the Microsoft COCO dataset.

They use whitening and coloring transforms to directly transform the $f_c$ (VGG feature
map of content image at a certain layer) to match the covariance matrix of $f_s$ (VGG feature
map of style image). This process is consisted of two steps, i.e., whitening and coloring transform. Note that the decoder will reconstruct the original content image if $f_c$ is directly fed into it.

==Whitening Transform==
Whitening first requires that the covariance of the data is a diagonal matrix. This is done by solving for the covariance matrix's eigenvalues and eigenvector matrices. Whitening then forces the diagonal elements of the eigenvalue matrix to be the same. This is achieved for a feature map from VGG through the following steps.

# The feature map $f_c$ is extracted from a layer of the encoder network after activation on the content image. This is the data to be whitened.
# $f_c$ is centered by subtracting its mean vector $m_c$.
# Then, the eigenvectors $E_c$ and eigenvalues $D_c$ are found for the covariance matrix of $f_c$.
# The whitened feature map is then given by $\hat{f}_c = E_c D_c^{-1/2} E_c^T f_c$.

If interested, the derivation of the whitening equation can be seen in [5]. Li et al. found that whitening removed styles from the image.

==Colour Transform==
However, whitening does not transfer style from the style image. It only uses feature maps from the content image. The colour transform uses both $\hat{f}_c$ from above and $f_s$, the feature map from the style image.

# $f_s$ is centered by subtracting its mean vector $m_s$.
# Then, the eigenvectors $E_s$ and eigenvalues $D_s$ are calculated for the covariance matrix of $f_s$.
# The colour transform is given by $\hat{f}_{cs} = E_s D_s^{1/2} E_s^T \hat{f}_c$.
# Recenter $\hat{f}_{cs}$ using $m_s$.

Intuitively, colouring results in a correlation between the $\hat{f}_c$ and $f_s$ feature maps. This is where the style transfer takes place.

==Content/Style Balance==
Using just $\hat{f}_{cs}$ as the input to the decoder may create a result that is too extreme in style. To balance content and style a new parameter $\alpha$ is defined.

<center>
<math>

\hat{f}_{cs} = \alpha \hat{f}_{cs} + (1 - \alpha) f_c

</math>
</center>

Authors use $\alpha$ = 0.6 in the style transfer experiments.

==Using Multiple Layers==
It has been previously mentioned that multiple decoders were trained, one for each of the first five layers of the encoder network. Each layer of a CNN perceives features at different levels. Levels close to the input image will detect lower level local features such as edges. Those levels deeper into the network will detect more complex global features. The style transfer algorithm is applied at each of these levels, which yields the question as to which results, as shown below, to use.

[[File:multilevel_features.png|thumb|700px|center|alt=Results of style transfer from each of the first five layers of the encoder network.|Results of style transfer from each of the first five layers of the encoder network.]]

Ideally, the results of each layer should be used to build the final output image. This captures the entire range of features detected by the encoder network. First, one full pass of the network is performed. Then the stylised image from the deepest layer (Relu_5_1 in this case) is taken and used as the content image for another iteration of the algorithm, where then the next layer (Relu_4_1) is used as the output. These steps are repeated until the final image is produced from the shallowest layer. This process is summarised in the figure below.

[[File:process_summary.png|thumb|700px|center|alt=Process summary of the multi-level stylization algorithm.|The content (C) and style (S) are fed to the VGG encoding network. The output image (I) after a whitening and colour transform (WCT) is taken from the deepest level's decoder. The process is iteratively repeated until the most shallow layer is reached.]]

The authors note that the transformations must be applied first at the highest level (most abstract) layers, which capture complicated local structures and pass this transformed image to lower layers, which improve on details. They observe that reversing this order (lowest to highest) leads to images with low visual quality, as low-level information cannot be preserved after manipulating high level features.

[[File:Universal_Style_Transfer_Coarse_to_Fine.JPG|thumb|700px|center|alt=(a)-(c) Output from intermediate layers. (d) Reversed transformation order.|(a)-(c) Output from intermediate layers. (d) Reversed transformation order.]]

=Evaluation=
The success of style transfer might appear hard to quantify as it relies on qualitative judgement. However, the extremes of transferring no style, or transferring only style can be considered as performing poorly. Consistent transfer of style throughout the entire image is another parameter of success. Ideally, the viewer can recognize the content of the image, while seeing it expressed in an alternative style. Quantitatively, the quality of the style transfer can be calculated by taking the covariance matrix difference $L_s$ between the resulting image and the original style. The results of the presented paper also need to be considered within the contexts of generality, efficiency and training requirements.

==Style Transfer==
A number of style transfer examples are presented relative to other works.

[[File:transfer_results_label.jpg|thumb|700px|center|alt=Style transfer results of the presented paper.|A: See [6]. B: See [7]. C: See [8]. D: Gatys et al. iterative optimization, see [2]. E: This paper's results.]]

Li et al. then obtained the average $L_s$ using 10 random content images across 40 style images. They had the lowest average $log(L_s)$ of all referenced works at 6.3. Next lowest was Gatys et al. [2] with $log(L_s) = 6.7$. It should be noted that while $L_s$ quantitatively calculates the success of the style transfer, results are still subject to the viewer's impression. Reviewing the transfer results, rows five and six for Gatys et al.'s method shows local minimization issues. However, their method still achieves a competitive $L_s$ score.

==Transfer Efficiency==
It was hypothesized by Li et al. that using WCT would enable faster run-times than [2] while still supporting arbitrary style transfer. For a 256x256 image, using a 12GB TITAN X, they achieved a transfer time of 1.5 seconds. Gatys et al.'s method [2] required 21.2 seconds. The pure feed-forward approaches [7], and [8] had times equal to or less than 0.2 seconds. [6] had a time comparable to the presented paper's method. However, [6,7,8] do not generalize well to multiple styles as training is required. Therefore this paper obtained a near 15x speed up for a style agnostic transfer algorithm when compared to leading previous work. The authors also note that WCT was done using the CPU. They intend to port WCT to the GPU and expect to see the computational time be further reduced.

==Other Applications==
Li et al.'s method can also be used for texture synthesis. This was the original work of Gatys et. al. before they applied their algorithm to style transfer problems. Texture synthesis takes a reference texture/image and creates new textures from it. With proper boundary conditions enforced these synthesized textures can be tileable. Alternatively, higher resolution textures can be generated. Texture synthesis has applications in areas such as computer graphics, allowing for large surfaces to be texture mapped.

The content image is set as white noise, similar to how [2] initializes their output image. Then the reference texture/image is set as the style image. Since the content image is initially random white noise, then the features generated by the encoder of this image are also random. Li et al. state that this increases the diversity of the resulting output textures.

[[File:texture_synthesis_label.jpg|thumb|700px|center|alt=Texture synthesis results.|A: Reference image/texture. B: Result from [8]. C: Result of present paper.]]

Reviewing the examples from the above figure, it can be observed that the method from this paper repeats fewer local features from the image than a competing feed forward network method [8]. While the analysis is qualitative, the authors claim that their method produces "more visually pleasing results".

=Conclusion=
Only a couple years ago were CNNs first used to stylize images. Today, a host of improvements have been developed, optimizing the original work of Gatys et al. for a number of different situations. Using additional training per style image, computational efficiency and image quality can be increased. However, the trained network then depends on that specific style image, or in some cases such as in [3], a set of style images. Till now, limited work has taken place in improving Gatys et al.'s method for arbitrary style images. The authors of this paper developed and evaluated a novel method for arbitrary style transfer. Their method and Gatys et al.'s method share the use of a VGG-19 CNN as the initial processing step. However, the authors replaced iterative optimization with whitening and colour transforms, which can be applied in a single step. This yields a decrease in computational time while maintaining generality with respect to the style image. After their CNN auto-encoder is initially trained no further training is required. This allows their method to be style agnostic. Their method also performs favourably, in terms of image quality, when compared to other current work.

=Critique=
In the paper, the authors only experimented with layers of VGG19. Given that architectures such as ResNet and Xception perform better on image recognition tasks, it would be interesting to see how residual layers and/or Inception modules may be applied to the task of disentangling style and content and whether they would improve performance relative to the results presented in the current paper is the encoder used were to utilize layers from these alternative convolutional architectures. Additionally, it is worth exploring whether one can invent a probabilistic and/or generative version of the encoder-decoder architecture used in the paper. More precisely, is it possible to come up with something in the spirit of variational autoencoders, wherein we the bottleneck layer can be used to sample noise vectors, which can then be input into each of the decoder units to generate synthetic style and content images.
Alternative attempts would also involve the study of generative adversarial networks with a perturbation threshold value. GANs can produce surreal images, where the underlying structure (content) is preserved ( in CNNs the filters learn the edges and surfaces and shape of the image), provided the Discriminator is trained for style classification ( training set consists of images pertaining the style that requires to be transferred).

=Additional Results and Figures=
Given in this section are the additional figures of universal style transform found in supplementary file. They are typically for larger image sizes and more variety of styles.
#[[File:style-1.PNG]]
#[[File:style-2.PNG]]
#[[File:style-3.PNG]]

=References=
[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.

[2] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.

[3] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[4] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016

[5] R. Picard. MAS 622J/1.126J: Pattern Recognition and Analysis, Lecture 4. http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.

[7] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868, 2017.

[8] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.

[9] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, https://arxiv.org/abs/1508.06576

[10] Karen Simonyan et al. Very Deep Convolutional Networks for Large-Scale Image Recognition

[11] VGG Architectures - [http://www.robots.ox.ac.uk/~vgg/research/very_deep/| More Details]

[12] Mechrez, R., Shechtman, E., & Zelnik-Manor, L. (2017). Photorealistic Style Transfer with Screened Poisson Equation. arXiv preprint arXiv:1709.09828.

[13] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In CVPR, 2017

[14] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, 2017

Implementation Example: https://github.com/titu1994/Neural-Style-Transfer

Universal Style Transfer via Feature Transforms

2017-11-20T00:53:56Z

A2prasad: /* Additional Results and Figures */

=Introduction=
When viewing an image, whether it is a photograph or a painting, two types of mutually exclusive data are present. First, there is the content of the image, such as a person in a portrait. However, the content does not uniquely define the image. Consider a case where multiple artists paint a portrait of an identical subject, the results would vary despite the content being invariant. The cause of the variance is rooted in the style of each particular artist. Therefore, style transfer between two images results in the content being unaffected but the style being copied. Style transfer is an important image editing task which enables the creation of new artistic works. Typically one image is termed the content/reference image, whose style is discarded. The other image is called the style image, whose style, but the not content is copied to the content image.

Deep learning techniques have been shown to be effective methods for implementing style transfer. Previous methods have been successful but with several key limitations and often trade off between generalization, quality and efficiency. Either they are fast, but have very few styles that can be transferred or they can handle arbitrary styles but are no longer efficient. The presented paper establishes a compromise between these two extremes by using only whitening and coloring transforms (WCT) to transfer a style within a feedforward image reconstruction architecture. No training of the underlying deep network is required per style.

=Related Work=
Gatys et al. developed a new method for generating textures from sample images in 2015 [1] and extended their approach to style transfer by 2016 [2]. They proposed the use of a pre-trained convolutional neural network (CNN) to separate content and style of input images. Having proven successful, a number of improvements quickly developed, reducing computational time, increasing the diversity of transferrable styes, and improving the quality of the results. Central to these approaches and of the present paper is the use of a CNN.

In 2017, Mechrez et al. [12] proposed an approach that takes as input a stylized image and makes it more photorealistic. Their approach relied on the Screened Poisson Equation, maintaining the fidelity of the stylized image while constraining the gradients to those of the original input image. The method they proposed was fast, simple, fully automatic and showed positive progress in making a stylized image photorealistic.

Alternative attempts, by using a single network to transfer
multiple styles, include models conditioned on binary selection units [13], a network that learns a set of new filters for every new style [15], and a novel conditional normalization layer that learns normalization parameters for each style [3]
==How Content and Style are Extracted using CNNs==
A CNN was chosen due to its ability to extract high level feature from images. These features can be interpreted in two ways. Within layer <math> l </math> there are <math> N_l </math> feature maps of size <math> M_l </math>. With a particular input image, the feature maps are given by <math> F_{i,j}^l </math> where <math> i </math> and <math> j </math> locate the map within the layer. Starting with a white noise image and an reference (content) image, the features can be transferred by minimizing

<center>
<math> \mathcal{L}_{content} = \frac{1}{2} \sum_{i,j} \left( F_{i,j}^l - P_{i,j}^l \right)^2 </math>
</center>

where <math> P_{i,j} </math> denotes the feature map output caused by the white noise image. Therefore this loss function preserves the content of the reference image. The style is described using a Gram matrix given by

<center>
<math>
G_{i,j}^l = \sum_k F_{i,k}^l F_{j,k}^l
</math>
</center>

Gram matrix $G$ of a set of vectors $v_1,\dots,v_n$ is the matrix of all possible inner products whose entries are given by $G_{ij}=v_i^Tv_j$. The loss function that describes a difference in style between two images is equal to:

<center>
<math>
\mathcal{L}_{style} = \frac{1}{4 N_l^2 M_l^2} \sum_{i,j} \left(G_{i,j}^l - A_{i,j}^l \right)^2
</math>
</center>

where <math> A_{i,j}^l </math> and <math> G_{i,j}^l </math> are the Gram matrices of the generated image and style image respectively. Therefore three images are required, a style image, a content image and an initial white noise image. Iterative optimization is then used to add content from one image to the white noise image, and style from the other. An additional parameter is used to balance the ratio of these loss functions.

The 19-layer ImageNet trained VGG network was chosen by Gatys et al. VGG-19 is still commonly used in more recent works as will be shown in the presented paper, although training datasets vary. Such CNNs are typically used in classification problems by finalizing their output through a series of full connected layers. For content and style extraction it is the convolutional layers that are required. The method of Gatys et al. is style independent, since the CNN does not need to be trained for each style image. However the process of iterative optimization to generate the output image is computational expensive.

==Other Methods==
Other methods avoid the inefficiency of iterative optimization by training a network/networks on a set of styles. The network then directly transfers the style from the style image to the content image without solving the iterative optimization problem. V. Dumoulin et al. trained a single network on $N$ styles [3]. This improved upon previous work where a network was required per style [4]. The stylized output image was generated by simply running a feedforward pass of the network on the content image. While efficiency is high, the method is no longer able to apply an arbitrary style without retraining.

=Methodology=
Li et al. have proposed a novel method for generating the stylized image. A CNN is still used as in Gatys et al. to extract content and style. However, the stylized image is not generated through iterative optimization or a feed-forward pass as required by previous methods. Instead, whitening and colour transforms are used.

==Image Reconstruction==
[[File:image_resconstruction.png|thumb|150px|right|alt=Training a single decoder.|Training a single decoder. X denotes the layer of the VGG encoder that the decoder receives as input.]]
An auto-encoder network is used to first encode an input image into a set of feature maps, and then decode it back to an image as shown in the adjacent figure. The encoder network used is VGG-19. This network is reponsible for obtaining feature maps (similar to Gatys et al.). The output of each of the first five layers is then fed into a corresponding decoder network, which is a mirrored version of VGG-19. Each decoder network then decodes the feature maps of the $l$th layer producing an output image. A mechanism for transferring style will be implemented by manipulating the feature maps between the encoder and decoder networks.

First, the auto-encoder network needs to be trained. The following loss function is used

<center>
<math>

\mathcal{L} = || I_{output} - I_{input} ||_2^2 + \lambda || \Phi(I_{output}) - \Phi(I_{input})||_2^2

</math>
</center>

where $I_{input}$ and $I_{output}$ are the input and output images of the auto-encoder. $\Phi$ is the VGG encoder. The first term of the loss is the pixel reconstruction loss, while the second term is feature loss. Recall from "Related Work" that the feature maps correspond to the content of the image. Therefore the second term can also be seen as penalising for content differences that arise due to the encoder network. The network was trained using the Microsoft COCO dataset.

They use whitening and coloring transforms to directly transform the $f_c$ (VGG feature
map of content image at a certain layer) to match the covariance matrix of $f_s$ (VGG feature
map of style image). This process is consisted of two steps, i.e., whitening and coloring transform. Note that the decoder will reconstruct the original content image if $f_c$ is directly fed into it.

==Whitening Transform==
Whitening first requires that the covariance of the data is a diagonal matrix. This is done by solving for the covariance matrix's eigenvalues and eigenvector matrices. Whitening then forces the diagonal elements of the eigenvalue matrix to be the same. This is achieved for a feature map from VGG through the following steps.

# The feature map $f_c$ is extracted from a layer of the encoder network after activation on the content image. This is the data to be whitened.
# $f_c$ is centered by subtracting its mean vector $m_c$.
# Then, the eigenvectors $E_c$ and eigenvalues $D_c$ are found for the covariance matrix of $f_c$.
# The whitened feature map is then given by $\hat{f}_c = E_c D_c^{-1/2} E_c^T f_c$.

If interested, the derivation of the whitening equation can be seen in [5]. Li et al. found that whitening removed styles from the image.

==Colour Transform==
However, whitening does not transfer style from the style image. It only uses feature maps from the content image. The colour transform uses both $\hat{f}_c$ from above and $f_s$, the feature map from the style image.

# $f_s$ is centered by subtracting its mean vector $m_s$.
# Then, the eigenvectors $E_s$ and eigenvalues $D_s$ are calculated for the covariance matrix of $f_s$.
# The colour transform is given by $\hat{f}_{cs} = E_s D_s^{1/2} E_s^T \hat{f}_c$.
# Recenter $\hat{f}_{cs}$ using $m_s$.

Intuitively, colouring results in a correlation between the $\hat{f}_c$ and $f_s$ feature maps. This is where the style transfer takes place.

==Content/Style Balance==
Using just $\hat{f}_{cs}$ as the input to the decoder may create a result that is too extreme in style. To balance content and style a new parameter $\alpha$ is defined.

<center>
<math>

\hat{f}_{cs} = \alpha \hat{f}_{cs} + (1 - \alpha) f_c

</math>
</center>

Authors use $\alpha$ = 0.6 in the style transfer experiments.

==Using Multiple Layers==
It has been previously mentioned that multiple decoders were trained, one for each of the first five layers of the encoder network. Each layer of a CNN perceives features at different levels. Levels close to the input image will detect lower level local features such as edges. Those levels deeper into the network will detect more complex global features. The style transfer algorithm is applied at each of these levels, which yields the question as to which results, as shown below, to use.

[[File:multilevel_features.png|thumb|700px|center|alt=Results of style transfer from each of the first five layers of the encoder network.|Results of style transfer from each of the first five layers of the encoder network.]]

Ideally, the results of each layer should be used to build the final output image. This captures the entire range of features detected by the encoder network. First, one full pass of the network is performed. Then the stylised image from the deepest layer (Relu_5_1 in this case) is taken and used as the content image for another iteration of the algorithm, where then the next layer (Relu_4_1) is used as the output. These steps are repeated until the final image is produced from the shallowest layer. This process is summarised in the figure below.

[[File:process_summary.png|thumb|700px|center|alt=Process summary of the multi-level stylization algorithm.|The content (C) and style (S) are fed to the VGG encoding network. The output image (I) after a whitening and colour transform (WCT) is taken from the deepest level's decoder. The process is iteratively repeated until the most shallow layer is reached.]]

The authors note that the transformations must be applied first at the highest level (most abstract) layers, which capture complicated local structures and pass this transformed image to lower layers, which improve on details. They observe that reversing this order (lowest to highest) leads to images with low visual quality, as low-level information cannot be preserved after manipulating high level features.

[[File:Universal_Style_Transfer_Coarse_to_Fine.JPG|thumb|700px|center|alt=(a)-(c) Output from intermediate layers. (d) Reversed transformation order.|(a)-(c) Output from intermediate layers. (d) Reversed transformation order.]]

=Evaluation=
The success of style transfer might appear hard to quantify as it relies on qualitative judgement. However, the extremes of transferring no style, or transferring only style can be considered as performing poorly. Consistent transfer of style throughout the entire image is another parameter of success. Ideally, the viewer can recognize the content of the image, while seeing it expressed in an alternative style. Quantitatively, the quality of the style transfer can be calculated by taking the covariance matrix difference $L_s$ between the resulting image and the original style. The results of the presented paper also need to be considered within the contexts of generality, efficiency and training requirements.

==Style Transfer==
A number of style transfer examples are presented relative to other works.

[[File:transfer_results_label.jpg|thumb|700px|center|alt=Style transfer results of the presented paper.|A: See [6]. B: See [7]. C: See [8]. D: Gatys et al. iterative optimization, see [2]. E: This paper's results.]]

Li et al. then obtained the average $L_s$ using 10 random content images across 40 style images. They had the lowest average $log(L_s)$ of all referenced works at 6.3. Next lowest was Gatys et al. [2] with $log(L_s) = 6.7$. It should be noted that while $L_s$ quantitatively calculates the success of the style transfer, results are still subject to the viewer's impression. Reviewing the transfer results, rows five and six for Gatys et al.'s method shows local minimization issues. However, their method still achieves a competitive $L_s$ score.

==Transfer Efficiency==
It was hypothesized by Li et al. that using WCT would enable faster run-times than [2] while still supporting arbitrary style transfer. For a 256x256 image, using a 12GB TITAN X, they achieved a transfer time of 1.5 seconds. Gatys et al.'s method [2] required 21.2 seconds. The pure feed-forward approaches [7], and [8] had times equal to or less than 0.2 seconds. [6] had a time comparable to the presented paper's method. However, [6,7,8] do not generalize well to multiple styles as training is required. Therefore this paper obtained a near 15x speed up for a style agnostic transfer algorithm when compared to leading previous work. The authors also note that WCT was done using the CPU. They intend to port WCT to the GPU and expect to see the computational time be further reduced.

==Other Applications==
Li et al.'s method can also be used for texture synthesis. This was the original work of Gatys et. al. before they applied their algorithm to style transfer problems. Texture synthesis takes a reference texture/image and creates new textures from it. With proper boundary conditions enforced these synthesized textures can be tileable. Alternatively, higher resolution textures can be generated. Texture synthesis has applications in areas such as computer graphics, allowing for large surfaces to be texture mapped.

The content image is set as white noise, similar to how [2] initializes their output image. Then the reference texture/image is set as the style image. Since the content image is initially random white noise, then the features generated by the encoder of this image are also random. Li et al. state that this increases the diversity of the resulting output textures.

[[File:texture_synthesis_label.jpg|thumb|700px|center|alt=Texture synthesis results.|A: Reference image/texture. B: Result from [8]. C: Result of present paper.]]

Reviewing the examples from the above figure, it can be observed that the method from this paper repeats fewer local features from the image than a competing feed forward network method [8]. While the analysis is qualitative, the authors claim that their method produces "more visually pleasing results".

=Conclusion=
Only a couple years ago were CNNs first used to stylize images. Today, a host of improvements have been developed, optimizing the original work of Gatys et al. for a number of different situations. Using additional training per style image, computational efficiency and image quality can be increased. However, the trained network then depends on that specific style image, or in some cases such as in [3], a set of style images. Till now, limited work has taken place in improving Gatys et al.'s method for arbitrary style images. The authors of this paper developed and evaluated a novel method for arbitrary style transfer. Their method and Gatys et al.'s method share the use of a VGG-19 CNN as the initial processing step. However, the authors replaced iterative optimization with whitening and colour transforms, which can be applied in a single step. This yields a decrease in computational time while maintaining generality with respect to the style image. After their CNN auto-encoder is initially trained no further training is required. This allows their method to be style agnostic. Their method also performs favourably, in terms of image quality, when compared to other current work.

=Critique=
In the paper, the authors only experimented with layers of VGG19. Given that architectures such as ResNet and Xception perform better on image recognition tasks, it would be interesting to see how residual layers and/or Inception modules may be applied to the task of disentangling style and content and whether they would improve performance relative to the results presented in the current paper is the encoder used were to utilize layers from these alternative convolutional architectures. Additionally, it is worth exploring whether one can invent a probabilistic and/or generative version of the encoder-decoder architecture used in the paper. More precisely, is it possible to come up with something in the spirit of variational autoencoders, wherein we the bottleneck layer can be used to sample noise vectors, which can then be input into each of the decoder units to generate synthetic style and content images.
Alternative attempts would also involve the study of generative adversarial networks with a perturbation threshold value. GANs can produce surreal images, where the underlying structure (content) is preserved ( in CNNs the filters learn the edges and surfaces and shape of the image), provided the Discriminator is trained for style classification ( training set consists of images pertaining the style that requires to be transferred).

=Additional Results and Figures=

#[[File:style-1.PNG]]
#[[File:style-2.PNG]]
#[[File:style-3.PNG]]

=References=
[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.

[2] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.

[3] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[4] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016

[5] R. Picard. MAS 622J/1.126J: Pattern Recognition and Analysis, Lecture 4. http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.

[7] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868, 2017.

[8] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.

[9] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, https://arxiv.org/abs/1508.06576

[10] Karen Simonyan et al. Very Deep Convolutional Networks for Large-Scale Image Recognition

[11] VGG Architectures - [http://www.robots.ox.ac.uk/~vgg/research/very_deep/| More Details]

[12] Mechrez, R., Shechtman, E., & Zelnik-Manor, L. (2017). Photorealistic Style Transfer with Screened Poisson Equation. arXiv preprint arXiv:1709.09828.

[13] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In CVPR, 2017

[14] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, 2017

Implementation Example: https://github.com/titu1994/Neural-Style-Transfer

Universal Style Transfer via Feature Transforms

2017-11-20T00:53:31Z

A2prasad: /* Additional Results and Figures */

=Introduction=
When viewing an image, whether it is a photograph or a painting, two types of mutually exclusive data are present. First, there is the content of the image, such as a person in a portrait. However, the content does not uniquely define the image. Consider a case where multiple artists paint a portrait of an identical subject, the results would vary despite the content being invariant. The cause of the variance is rooted in the style of each particular artist. Therefore, style transfer between two images results in the content being unaffected but the style being copied. Style transfer is an important image editing task which enables the creation of new artistic works. Typically one image is termed the content/reference image, whose style is discarded. The other image is called the style image, whose style, but the not content is copied to the content image.

Deep learning techniques have been shown to be effective methods for implementing style transfer. Previous methods have been successful but with several key limitations and often trade off between generalization, quality and efficiency. Either they are fast, but have very few styles that can be transferred or they can handle arbitrary styles but are no longer efficient. The presented paper establishes a compromise between these two extremes by using only whitening and coloring transforms (WCT) to transfer a style within a feedforward image reconstruction architecture. No training of the underlying deep network is required per style.

=Related Work=
Gatys et al. developed a new method for generating textures from sample images in 2015 [1] and extended their approach to style transfer by 2016 [2]. They proposed the use of a pre-trained convolutional neural network (CNN) to separate content and style of input images. Having proven successful, a number of improvements quickly developed, reducing computational time, increasing the diversity of transferrable styes, and improving the quality of the results. Central to these approaches and of the present paper is the use of a CNN.

In 2017, Mechrez et al. [12] proposed an approach that takes as input a stylized image and makes it more photorealistic. Their approach relied on the Screened Poisson Equation, maintaining the fidelity of the stylized image while constraining the gradients to those of the original input image. The method they proposed was fast, simple, fully automatic and showed positive progress in making a stylized image photorealistic.

Alternative attempts, by using a single network to transfer
multiple styles, include models conditioned on binary selection units [13], a network that learns a set of new filters for every new style [15], and a novel conditional normalization layer that learns normalization parameters for each style [3]
==How Content and Style are Extracted using CNNs==
A CNN was chosen due to its ability to extract high level feature from images. These features can be interpreted in two ways. Within layer <math> l </math> there are <math> N_l </math> feature maps of size <math> M_l </math>. With a particular input image, the feature maps are given by <math> F_{i,j}^l </math> where <math> i </math> and <math> j </math> locate the map within the layer. Starting with a white noise image and an reference (content) image, the features can be transferred by minimizing

<center>
<math> \mathcal{L}_{content} = \frac{1}{2} \sum_{i,j} \left( F_{i,j}^l - P_{i,j}^l \right)^2 </math>
</center>

where <math> P_{i,j} </math> denotes the feature map output caused by the white noise image. Therefore this loss function preserves the content of the reference image. The style is described using a Gram matrix given by

<center>
<math>
G_{i,j}^l = \sum_k F_{i,k}^l F_{j,k}^l
</math>
</center>

Gram matrix $G$ of a set of vectors $v_1,\dots,v_n$ is the matrix of all possible inner products whose entries are given by $G_{ij}=v_i^Tv_j$. The loss function that describes a difference in style between two images is equal to:

<center>
<math>
\mathcal{L}_{style} = \frac{1}{4 N_l^2 M_l^2} \sum_{i,j} \left(G_{i,j}^l - A_{i,j}^l \right)^2
</math>
</center>

where <math> A_{i,j}^l </math> and <math> G_{i,j}^l </math> are the Gram matrices of the generated image and style image respectively. Therefore three images are required, a style image, a content image and an initial white noise image. Iterative optimization is then used to add content from one image to the white noise image, and style from the other. An additional parameter is used to balance the ratio of these loss functions.

The 19-layer ImageNet trained VGG network was chosen by Gatys et al. VGG-19 is still commonly used in more recent works as will be shown in the presented paper, although training datasets vary. Such CNNs are typically used in classification problems by finalizing their output through a series of full connected layers. For content and style extraction it is the convolutional layers that are required. The method of Gatys et al. is style independent, since the CNN does not need to be trained for each style image. However the process of iterative optimization to generate the output image is computational expensive.

==Other Methods==
Other methods avoid the inefficiency of iterative optimization by training a network/networks on a set of styles. The network then directly transfers the style from the style image to the content image without solving the iterative optimization problem. V. Dumoulin et al. trained a single network on $N$ styles [3]. This improved upon previous work where a network was required per style [4]. The stylized output image was generated by simply running a feedforward pass of the network on the content image. While efficiency is high, the method is no longer able to apply an arbitrary style without retraining.

=Methodology=
Li et al. have proposed a novel method for generating the stylized image. A CNN is still used as in Gatys et al. to extract content and style. However, the stylized image is not generated through iterative optimization or a feed-forward pass as required by previous methods. Instead, whitening and colour transforms are used.

==Image Reconstruction==
[[File:image_resconstruction.png|thumb|150px|right|alt=Training a single decoder.|Training a single decoder. X denotes the layer of the VGG encoder that the decoder receives as input.]]
An auto-encoder network is used to first encode an input image into a set of feature maps, and then decode it back to an image as shown in the adjacent figure. The encoder network used is VGG-19. This network is reponsible for obtaining feature maps (similar to Gatys et al.). The output of each of the first five layers is then fed into a corresponding decoder network, which is a mirrored version of VGG-19. Each decoder network then decodes the feature maps of the $l$th layer producing an output image. A mechanism for transferring style will be implemented by manipulating the feature maps between the encoder and decoder networks.

First, the auto-encoder network needs to be trained. The following loss function is used

<center>
<math>

\mathcal{L} = || I_{output} - I_{input} ||_2^2 + \lambda || \Phi(I_{output}) - \Phi(I_{input})||_2^2

</math>
</center>

where $I_{input}$ and $I_{output}$ are the input and output images of the auto-encoder. $\Phi$ is the VGG encoder. The first term of the loss is the pixel reconstruction loss, while the second term is feature loss. Recall from "Related Work" that the feature maps correspond to the content of the image. Therefore the second term can also be seen as penalising for content differences that arise due to the encoder network. The network was trained using the Microsoft COCO dataset.

They use whitening and coloring transforms to directly transform the $f_c$ (VGG feature
map of content image at a certain layer) to match the covariance matrix of $f_s$ (VGG feature
map of style image). This process is consisted of two steps, i.e., whitening and coloring transform. Note that the decoder will reconstruct the original content image if $f_c$ is directly fed into it.

==Whitening Transform==
Whitening first requires that the covariance of the data is a diagonal matrix. This is done by solving for the covariance matrix's eigenvalues and eigenvector matrices. Whitening then forces the diagonal elements of the eigenvalue matrix to be the same. This is achieved for a feature map from VGG through the following steps.

# The feature map $f_c$ is extracted from a layer of the encoder network after activation on the content image. This is the data to be whitened.
# $f_c$ is centered by subtracting its mean vector $m_c$.
# Then, the eigenvectors $E_c$ and eigenvalues $D_c$ are found for the covariance matrix of $f_c$.
# The whitened feature map is then given by $\hat{f}_c = E_c D_c^{-1/2} E_c^T f_c$.

If interested, the derivation of the whitening equation can be seen in [5]. Li et al. found that whitening removed styles from the image.

==Colour Transform==
However, whitening does not transfer style from the style image. It only uses feature maps from the content image. The colour transform uses both $\hat{f}_c$ from above and $f_s$, the feature map from the style image.

# $f_s$ is centered by subtracting its mean vector $m_s$.
# Then, the eigenvectors $E_s$ and eigenvalues $D_s$ are calculated for the covariance matrix of $f_s$.
# The colour transform is given by $\hat{f}_{cs} = E_s D_s^{1/2} E_s^T \hat{f}_c$.
# Recenter $\hat{f}_{cs}$ using $m_s$.

Intuitively, colouring results in a correlation between the $\hat{f}_c$ and $f_s$ feature maps. This is where the style transfer takes place.

==Content/Style Balance==
Using just $\hat{f}_{cs}$ as the input to the decoder may create a result that is too extreme in style. To balance content and style a new parameter $\alpha$ is defined.

<center>
<math>

\hat{f}_{cs} = \alpha \hat{f}_{cs} + (1 - \alpha) f_c

</math>
</center>

Authors use $\alpha$ = 0.6 in the style transfer experiments.

==Using Multiple Layers==
It has been previously mentioned that multiple decoders were trained, one for each of the first five layers of the encoder network. Each layer of a CNN perceives features at different levels. Levels close to the input image will detect lower level local features such as edges. Those levels deeper into the network will detect more complex global features. The style transfer algorithm is applied at each of these levels, which yields the question as to which results, as shown below, to use.

[[File:multilevel_features.png|thumb|700px|center|alt=Results of style transfer from each of the first five layers of the encoder network.|Results of style transfer from each of the first five layers of the encoder network.]]

Ideally, the results of each layer should be used to build the final output image. This captures the entire range of features detected by the encoder network. First, one full pass of the network is performed. Then the stylised image from the deepest layer (Relu_5_1 in this case) is taken and used as the content image for another iteration of the algorithm, where then the next layer (Relu_4_1) is used as the output. These steps are repeated until the final image is produced from the shallowest layer. This process is summarised in the figure below.

[[File:process_summary.png|thumb|700px|center|alt=Process summary of the multi-level stylization algorithm.|The content (C) and style (S) are fed to the VGG encoding network. The output image (I) after a whitening and colour transform (WCT) is taken from the deepest level's decoder. The process is iteratively repeated until the most shallow layer is reached.]]

The authors note that the transformations must be applied first at the highest level (most abstract) layers, which capture complicated local structures and pass this transformed image to lower layers, which improve on details. They observe that reversing this order (lowest to highest) leads to images with low visual quality, as low-level information cannot be preserved after manipulating high level features.

[[File:Universal_Style_Transfer_Coarse_to_Fine.JPG|thumb|700px|center|alt=(a)-(c) Output from intermediate layers. (d) Reversed transformation order.|(a)-(c) Output from intermediate layers. (d) Reversed transformation order.]]

=Evaluation=
The success of style transfer might appear hard to quantify as it relies on qualitative judgement. However, the extremes of transferring no style, or transferring only style can be considered as performing poorly. Consistent transfer of style throughout the entire image is another parameter of success. Ideally, the viewer can recognize the content of the image, while seeing it expressed in an alternative style. Quantitatively, the quality of the style transfer can be calculated by taking the covariance matrix difference $L_s$ between the resulting image and the original style. The results of the presented paper also need to be considered within the contexts of generality, efficiency and training requirements.

==Style Transfer==
A number of style transfer examples are presented relative to other works.

[[File:transfer_results_label.jpg|thumb|700px|center|alt=Style transfer results of the presented paper.|A: See [6]. B: See [7]. C: See [8]. D: Gatys et al. iterative optimization, see [2]. E: This paper's results.]]

Li et al. then obtained the average $L_s$ using 10 random content images across 40 style images. They had the lowest average $log(L_s)$ of all referenced works at 6.3. Next lowest was Gatys et al. [2] with $log(L_s) = 6.7$. It should be noted that while $L_s$ quantitatively calculates the success of the style transfer, results are still subject to the viewer's impression. Reviewing the transfer results, rows five and six for Gatys et al.'s method shows local minimization issues. However, their method still achieves a competitive $L_s$ score.

==Transfer Efficiency==
It was hypothesized by Li et al. that using WCT would enable faster run-times than [2] while still supporting arbitrary style transfer. For a 256x256 image, using a 12GB TITAN X, they achieved a transfer time of 1.5 seconds. Gatys et al.'s method [2] required 21.2 seconds. The pure feed-forward approaches [7], and [8] had times equal to or less than 0.2 seconds. [6] had a time comparable to the presented paper's method. However, [6,7,8] do not generalize well to multiple styles as training is required. Therefore this paper obtained a near 15x speed up for a style agnostic transfer algorithm when compared to leading previous work. The authors also note that WCT was done using the CPU. They intend to port WCT to the GPU and expect to see the computational time be further reduced.

==Other Applications==
Li et al.'s method can also be used for texture synthesis. This was the original work of Gatys et. al. before they applied their algorithm to style transfer problems. Texture synthesis takes a reference texture/image and creates new textures from it. With proper boundary conditions enforced these synthesized textures can be tileable. Alternatively, higher resolution textures can be generated. Texture synthesis has applications in areas such as computer graphics, allowing for large surfaces to be texture mapped.

The content image is set as white noise, similar to how [2] initializes their output image. Then the reference texture/image is set as the style image. Since the content image is initially random white noise, then the features generated by the encoder of this image are also random. Li et al. state that this increases the diversity of the resulting output textures.

[[File:texture_synthesis_label.jpg|thumb|700px|center|alt=Texture synthesis results.|A: Reference image/texture. B: Result from [8]. C: Result of present paper.]]

Reviewing the examples from the above figure, it can be observed that the method from this paper repeats fewer local features from the image than a competing feed forward network method [8]. While the analysis is qualitative, the authors claim that their method produces "more visually pleasing results".

=Conclusion=
Only a couple years ago were CNNs first used to stylize images. Today, a host of improvements have been developed, optimizing the original work of Gatys et al. for a number of different situations. Using additional training per style image, computational efficiency and image quality can be increased. However, the trained network then depends on that specific style image, or in some cases such as in [3], a set of style images. Till now, limited work has taken place in improving Gatys et al.'s method for arbitrary style images. The authors of this paper developed and evaluated a novel method for arbitrary style transfer. Their method and Gatys et al.'s method share the use of a VGG-19 CNN as the initial processing step. However, the authors replaced iterative optimization with whitening and colour transforms, which can be applied in a single step. This yields a decrease in computational time while maintaining generality with respect to the style image. After their CNN auto-encoder is initially trained no further training is required. This allows their method to be style agnostic. Their method also performs favourably, in terms of image quality, when compared to other current work.

=Critique=
In the paper, the authors only experimented with layers of VGG19. Given that architectures such as ResNet and Xception perform better on image recognition tasks, it would be interesting to see how residual layers and/or Inception modules may be applied to the task of disentangling style and content and whether they would improve performance relative to the results presented in the current paper is the encoder used were to utilize layers from these alternative convolutional architectures. Additionally, it is worth exploring whether one can invent a probabilistic and/or generative version of the encoder-decoder architecture used in the paper. More precisely, is it possible to come up with something in the spirit of variational autoencoders, wherein we the bottleneck layer can be used to sample noise vectors, which can then be input into each of the decoder units to generate synthetic style and content images.
Alternative attempts would also involve the study of generative adversarial networks with a perturbation threshold value. GANs can produce surreal images, where the underlying structure (content) is preserved ( in CNNs the filters learn the edges and surfaces and shape of the image), provided the Discriminator is trained for style classification ( training set consists of images pertaining the style that requires to be transferred).

=Additional Results and Figures=

[[File:style-1.PNG]]
[[File:style-2.PNG]]
[[File:style-3.PNG]]

=References=
[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.

[2] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.

[3] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[4] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016

[5] R. Picard. MAS 622J/1.126J: Pattern Recognition and Analysis, Lecture 4. http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.

[7] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868, 2017.

[8] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.

[9] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, https://arxiv.org/abs/1508.06576

[10] Karen Simonyan et al. Very Deep Convolutional Networks for Large-Scale Image Recognition

[11] VGG Architectures - [http://www.robots.ox.ac.uk/~vgg/research/very_deep/| More Details]

[12] Mechrez, R., Shechtman, E., & Zelnik-Manor, L. (2017). Photorealistic Style Transfer with Screened Poisson Equation. arXiv preprint arXiv:1709.09828.

[13] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In CVPR, 2017

[14] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, 2017

Implementation Example: https://github.com/titu1994/Neural-Style-Transfer

Universal Style Transfer via Feature Transforms

2017-11-20T00:53:04Z

A2prasad:

=Introduction=
When viewing an image, whether it is a photograph or a painting, two types of mutually exclusive data are present. First, there is the content of the image, such as a person in a portrait. However, the content does not uniquely define the image. Consider a case where multiple artists paint a portrait of an identical subject, the results would vary despite the content being invariant. The cause of the variance is rooted in the style of each particular artist. Therefore, style transfer between two images results in the content being unaffected but the style being copied. Style transfer is an important image editing task which enables the creation of new artistic works. Typically one image is termed the content/reference image, whose style is discarded. The other image is called the style image, whose style, but the not content is copied to the content image.

Deep learning techniques have been shown to be effective methods for implementing style transfer. Previous methods have been successful but with several key limitations and often trade off between generalization, quality and efficiency. Either they are fast, but have very few styles that can be transferred or they can handle arbitrary styles but are no longer efficient. The presented paper establishes a compromise between these two extremes by using only whitening and coloring transforms (WCT) to transfer a style within a feedforward image reconstruction architecture. No training of the underlying deep network is required per style.

=Related Work=
Gatys et al. developed a new method for generating textures from sample images in 2015 [1] and extended their approach to style transfer by 2016 [2]. They proposed the use of a pre-trained convolutional neural network (CNN) to separate content and style of input images. Having proven successful, a number of improvements quickly developed, reducing computational time, increasing the diversity of transferrable styes, and improving the quality of the results. Central to these approaches and of the present paper is the use of a CNN.

In 2017, Mechrez et al. [12] proposed an approach that takes as input a stylized image and makes it more photorealistic. Their approach relied on the Screened Poisson Equation, maintaining the fidelity of the stylized image while constraining the gradients to those of the original input image. The method they proposed was fast, simple, fully automatic and showed positive progress in making a stylized image photorealistic.

Alternative attempts, by using a single network to transfer
multiple styles, include models conditioned on binary selection units [13], a network that learns a set of new filters for every new style [15], and a novel conditional normalization layer that learns normalization parameters for each style [3]
==How Content and Style are Extracted using CNNs==
A CNN was chosen due to its ability to extract high level feature from images. These features can be interpreted in two ways. Within layer <math> l </math> there are <math> N_l </math> feature maps of size <math> M_l </math>. With a particular input image, the feature maps are given by <math> F_{i,j}^l </math> where <math> i </math> and <math> j </math> locate the map within the layer. Starting with a white noise image and an reference (content) image, the features can be transferred by minimizing

<center>
<math> \mathcal{L}_{content} = \frac{1}{2} \sum_{i,j} \left( F_{i,j}^l - P_{i,j}^l \right)^2 </math>
</center>

where <math> P_{i,j} </math> denotes the feature map output caused by the white noise image. Therefore this loss function preserves the content of the reference image. The style is described using a Gram matrix given by

<center>
<math>
G_{i,j}^l = \sum_k F_{i,k}^l F_{j,k}^l
</math>
</center>

Gram matrix $G$ of a set of vectors $v_1,\dots,v_n$ is the matrix of all possible inner products whose entries are given by $G_{ij}=v_i^Tv_j$. The loss function that describes a difference in style between two images is equal to:

<center>
<math>
\mathcal{L}_{style} = \frac{1}{4 N_l^2 M_l^2} \sum_{i,j} \left(G_{i,j}^l - A_{i,j}^l \right)^2
</math>
</center>

where <math> A_{i,j}^l </math> and <math> G_{i,j}^l </math> are the Gram matrices of the generated image and style image respectively. Therefore three images are required, a style image, a content image and an initial white noise image. Iterative optimization is then used to add content from one image to the white noise image, and style from the other. An additional parameter is used to balance the ratio of these loss functions.

The 19-layer ImageNet trained VGG network was chosen by Gatys et al. VGG-19 is still commonly used in more recent works as will be shown in the presented paper, although training datasets vary. Such CNNs are typically used in classification problems by finalizing their output through a series of full connected layers. For content and style extraction it is the convolutional layers that are required. The method of Gatys et al. is style independent, since the CNN does not need to be trained for each style image. However the process of iterative optimization to generate the output image is computational expensive.

==Other Methods==
Other methods avoid the inefficiency of iterative optimization by training a network/networks on a set of styles. The network then directly transfers the style from the style image to the content image without solving the iterative optimization problem. V. Dumoulin et al. trained a single network on $N$ styles [3]. This improved upon previous work where a network was required per style [4]. The stylized output image was generated by simply running a feedforward pass of the network on the content image. While efficiency is high, the method is no longer able to apply an arbitrary style without retraining.

=Methodology=
Li et al. have proposed a novel method for generating the stylized image. A CNN is still used as in Gatys et al. to extract content and style. However, the stylized image is not generated through iterative optimization or a feed-forward pass as required by previous methods. Instead, whitening and colour transforms are used.

==Image Reconstruction==
[[File:image_resconstruction.png|thumb|150px|right|alt=Training a single decoder.|Training a single decoder. X denotes the layer of the VGG encoder that the decoder receives as input.]]
An auto-encoder network is used to first encode an input image into a set of feature maps, and then decode it back to an image as shown in the adjacent figure. The encoder network used is VGG-19. This network is reponsible for obtaining feature maps (similar to Gatys et al.). The output of each of the first five layers is then fed into a corresponding decoder network, which is a mirrored version of VGG-19. Each decoder network then decodes the feature maps of the $l$th layer producing an output image. A mechanism for transferring style will be implemented by manipulating the feature maps between the encoder and decoder networks.

First, the auto-encoder network needs to be trained. The following loss function is used

<center>
<math>

\mathcal{L} = || I_{output} - I_{input} ||_2^2 + \lambda || \Phi(I_{output}) - \Phi(I_{input})||_2^2

</math>
</center>

where $I_{input}$ and $I_{output}$ are the input and output images of the auto-encoder. $\Phi$ is the VGG encoder. The first term of the loss is the pixel reconstruction loss, while the second term is feature loss. Recall from "Related Work" that the feature maps correspond to the content of the image. Therefore the second term can also be seen as penalising for content differences that arise due to the encoder network. The network was trained using the Microsoft COCO dataset.

They use whitening and coloring transforms to directly transform the $f_c$ (VGG feature
map of content image at a certain layer) to match the covariance matrix of $f_s$ (VGG feature
map of style image). This process is consisted of two steps, i.e., whitening and coloring transform. Note that the decoder will reconstruct the original content image if $f_c$ is directly fed into it.

==Whitening Transform==
Whitening first requires that the covariance of the data is a diagonal matrix. This is done by solving for the covariance matrix's eigenvalues and eigenvector matrices. Whitening then forces the diagonal elements of the eigenvalue matrix to be the same. This is achieved for a feature map from VGG through the following steps.

# The feature map $f_c$ is extracted from a layer of the encoder network after activation on the content image. This is the data to be whitened.
# $f_c$ is centered by subtracting its mean vector $m_c$.
# Then, the eigenvectors $E_c$ and eigenvalues $D_c$ are found for the covariance matrix of $f_c$.
# The whitened feature map is then given by $\hat{f}_c = E_c D_c^{-1/2} E_c^T f_c$.

If interested, the derivation of the whitening equation can be seen in [5]. Li et al. found that whitening removed styles from the image.

==Colour Transform==
However, whitening does not transfer style from the style image. It only uses feature maps from the content image. The colour transform uses both $\hat{f}_c$ from above and $f_s$, the feature map from the style image.

# $f_s$ is centered by subtracting its mean vector $m_s$.
# Then, the eigenvectors $E_s$ and eigenvalues $D_s$ are calculated for the covariance matrix of $f_s$.
# The colour transform is given by $\hat{f}_{cs} = E_s D_s^{1/2} E_s^T \hat{f}_c$.
# Recenter $\hat{f}_{cs}$ using $m_s$.

Intuitively, colouring results in a correlation between the $\hat{f}_c$ and $f_s$ feature maps. This is where the style transfer takes place.

==Content/Style Balance==
Using just $\hat{f}_{cs}$ as the input to the decoder may create a result that is too extreme in style. To balance content and style a new parameter $\alpha$ is defined.

<center>
<math>

\hat{f}_{cs} = \alpha \hat{f}_{cs} + (1 - \alpha) f_c

</math>
</center>

Authors use $\alpha$ = 0.6 in the style transfer experiments.

==Using Multiple Layers==
It has been previously mentioned that multiple decoders were trained, one for each of the first five layers of the encoder network. Each layer of a CNN perceives features at different levels. Levels close to the input image will detect lower level local features such as edges. Those levels deeper into the network will detect more complex global features. The style transfer algorithm is applied at each of these levels, which yields the question as to which results, as shown below, to use.

[[File:multilevel_features.png|thumb|700px|center|alt=Results of style transfer from each of the first five layers of the encoder network.|Results of style transfer from each of the first five layers of the encoder network.]]

Ideally, the results of each layer should be used to build the final output image. This captures the entire range of features detected by the encoder network. First, one full pass of the network is performed. Then the stylised image from the deepest layer (Relu_5_1 in this case) is taken and used as the content image for another iteration of the algorithm, where then the next layer (Relu_4_1) is used as the output. These steps are repeated until the final image is produced from the shallowest layer. This process is summarised in the figure below.

[[File:process_summary.png|thumb|700px|center|alt=Process summary of the multi-level stylization algorithm.|The content (C) and style (S) are fed to the VGG encoding network. The output image (I) after a whitening and colour transform (WCT) is taken from the deepest level's decoder. The process is iteratively repeated until the most shallow layer is reached.]]

The authors note that the transformations must be applied first at the highest level (most abstract) layers, which capture complicated local structures and pass this transformed image to lower layers, which improve on details. They observe that reversing this order (lowest to highest) leads to images with low visual quality, as low-level information cannot be preserved after manipulating high level features.

[[File:Universal_Style_Transfer_Coarse_to_Fine.JPG|thumb|700px|center|alt=(a)-(c) Output from intermediate layers. (d) Reversed transformation order.|(a)-(c) Output from intermediate layers. (d) Reversed transformation order.]]

=Evaluation=
The success of style transfer might appear hard to quantify as it relies on qualitative judgement. However, the extremes of transferring no style, or transferring only style can be considered as performing poorly. Consistent transfer of style throughout the entire image is another parameter of success. Ideally, the viewer can recognize the content of the image, while seeing it expressed in an alternative style. Quantitatively, the quality of the style transfer can be calculated by taking the covariance matrix difference $L_s$ between the resulting image and the original style. The results of the presented paper also need to be considered within the contexts of generality, efficiency and training requirements.

==Style Transfer==
A number of style transfer examples are presented relative to other works.

[[File:transfer_results_label.jpg|thumb|700px|center|alt=Style transfer results of the presented paper.|A: See [6]. B: See [7]. C: See [8]. D: Gatys et al. iterative optimization, see [2]. E: This paper's results.]]

Li et al. then obtained the average $L_s$ using 10 random content images across 40 style images. They had the lowest average $log(L_s)$ of all referenced works at 6.3. Next lowest was Gatys et al. [2] with $log(L_s) = 6.7$. It should be noted that while $L_s$ quantitatively calculates the success of the style transfer, results are still subject to the viewer's impression. Reviewing the transfer results, rows five and six for Gatys et al.'s method shows local minimization issues. However, their method still achieves a competitive $L_s$ score.

==Transfer Efficiency==
It was hypothesized by Li et al. that using WCT would enable faster run-times than [2] while still supporting arbitrary style transfer. For a 256x256 image, using a 12GB TITAN X, they achieved a transfer time of 1.5 seconds. Gatys et al.'s method [2] required 21.2 seconds. The pure feed-forward approaches [7], and [8] had times equal to or less than 0.2 seconds. [6] had a time comparable to the presented paper's method. However, [6,7,8] do not generalize well to multiple styles as training is required. Therefore this paper obtained a near 15x speed up for a style agnostic transfer algorithm when compared to leading previous work. The authors also note that WCT was done using the CPU. They intend to port WCT to the GPU and expect to see the computational time be further reduced.

==Other Applications==
Li et al.'s method can also be used for texture synthesis. This was the original work of Gatys et. al. before they applied their algorithm to style transfer problems. Texture synthesis takes a reference texture/image and creates new textures from it. With proper boundary conditions enforced these synthesized textures can be tileable. Alternatively, higher resolution textures can be generated. Texture synthesis has applications in areas such as computer graphics, allowing for large surfaces to be texture mapped.

The content image is set as white noise, similar to how [2] initializes their output image. Then the reference texture/image is set as the style image. Since the content image is initially random white noise, then the features generated by the encoder of this image are also random. Li et al. state that this increases the diversity of the resulting output textures.

[[File:texture_synthesis_label.jpg|thumb|700px|center|alt=Texture synthesis results.|A: Reference image/texture. B: Result from [8]. C: Result of present paper.]]

Reviewing the examples from the above figure, it can be observed that the method from this paper repeats fewer local features from the image than a competing feed forward network method [8]. While the analysis is qualitative, the authors claim that their method produces "more visually pleasing results".

=Conclusion=
Only a couple years ago were CNNs first used to stylize images. Today, a host of improvements have been developed, optimizing the original work of Gatys et al. for a number of different situations. Using additional training per style image, computational efficiency and image quality can be increased. However, the trained network then depends on that specific style image, or in some cases such as in [3], a set of style images. Till now, limited work has taken place in improving Gatys et al.'s method for arbitrary style images. The authors of this paper developed and evaluated a novel method for arbitrary style transfer. Their method and Gatys et al.'s method share the use of a VGG-19 CNN as the initial processing step. However, the authors replaced iterative optimization with whitening and colour transforms, which can be applied in a single step. This yields a decrease in computational time while maintaining generality with respect to the style image. After their CNN auto-encoder is initially trained no further training is required. This allows their method to be style agnostic. Their method also performs favourably, in terms of image quality, when compared to other current work.

=Critique=
In the paper, the authors only experimented with layers of VGG19. Given that architectures such as ResNet and Xception perform better on image recognition tasks, it would be interesting to see how residual layers and/or Inception modules may be applied to the task of disentangling style and content and whether they would improve performance relative to the results presented in the current paper is the encoder used were to utilize layers from these alternative convolutional architectures. Additionally, it is worth exploring whether one can invent a probabilistic and/or generative version of the encoder-decoder architecture used in the paper. More precisely, is it possible to come up with something in the spirit of variational autoencoders, wherein we the bottleneck layer can be used to sample noise vectors, which can then be input into each of the decoder units to generate synthetic style and content images.
Alternative attempts would also involve the study of generative adversarial networks with a perturbation threshold value. GANs can produce surreal images, where the underlying structure (content) is preserved ( in CNNs the filters learn the edges and surfaces and shape of the image), provided the Discriminator is trained for style classification ( training set consists of images pertaining the style that requires to be transferred).

=Additional Results and Figures=

[[File:style-1.png]]
[[File:style-2.png]]
[[File:style-3.png]]

=References=
[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.

[2] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.

[3] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[4] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016

[5] R. Picard. MAS 622J/1.126J: Pattern Recognition and Analysis, Lecture 4. http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.

[7] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868, 2017.

[8] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.

[9] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, https://arxiv.org/abs/1508.06576

[10] Karen Simonyan et al. Very Deep Convolutional Networks for Large-Scale Image Recognition

[11] VGG Architectures - [http://www.robots.ox.ac.uk/~vgg/research/very_deep/| More Details]

[12] Mechrez, R., Shechtman, E., & Zelnik-Manor, L. (2017). Photorealistic Style Transfer with Screened Poisson Equation. arXiv preprint arXiv:1709.09828.

[13] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In CVPR, 2017

[14] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, 2017

Implementation Example: https://github.com/titu1994/Neural-Style-Transfer

File:style-3.PNG

2017-11-20T00:52:46Z

A2prasad:

File:style-2.PNG

2017-11-20T00:51:17Z

A2prasad:

File:style-1.PNG

2017-11-20T00:48:03Z

A2prasad:

Modular Multitask Reinforcement Learning with Policy Sketches

2017-11-20T00:20:09Z

A2prasad: /* Conclusion & Critique */

='''Introduction & Background'''=
[[File:MRL0.png|border|right|400px]]
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.

This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.

The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.

The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions:
# Learning the full collection of tasks jointly via reinforcement learning
# In a zero-shot setting where a policy sketch is available for a held-out task
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.

The code has been released at http://github.com/jacobandreas/psketch.

='''Related Work'''=
The approach in this paper is a specific case of the options framework developed by Sutton et al., 1999. In that work, options are introduced as "closed-loop policies for taking action over the period of time". They show that options enable temporally abstract information to be included in reinforcement learning algorithms, though it was published before the large-scale popularity of neural networks for reinforcement.

Other authors have recently explored techniques for learning policies which require less prior knowledge of the environment than the method presented in this paper. For example, in Vezhnevets et al. (2016), the authors propose a RNN architecture to build "implicit plans" only through interacting with the environment as in the classic reinforcement learning problem formulation.

One closely related line of work is the Hierarchical Abstract Machines (HAM) framework introduced by Parr & Russell, 1998 [11]. Like the approach which the Modular Multitask Reinforcement Learning with Policy Sketches uses, HAMs begin with a representation of a high-level policy as an automaton (or a more general computer program; Andre & Russell,
2001 [7]; Marthi et al., 2004 [12]) and use reinforcement learning to fill in low-level details.

='''Learning Modular Policies from Sketches'''=
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with
* $S$ a set of states
* $A$ a set of low-level actions
* $P : S \times A \times S \to R$ a transition probability distribution
* $\gamma$ a discount factor

Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.

==Model==
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.

[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]

At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.

In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).

Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:
[[File:MRL1.png|center|border|]]

Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward
[[File:MRL2.png|center|border|]]
across all tasks $t \in T$.

==Policy Optimization==

Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:
[[File:MRL3.png|center|border|]]

where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.
[[File:MRL4.png|frame|]]

In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that
[[File:MRL5.png|center|border|]]
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:
[[File:MRL6.png|center|border|]]
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.

Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by
[[File:MRL7.png|center|border|]]
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.

The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.

==Curriculum Learning==

For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.

To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.

Intuitively, the tasks that provide the strongest learning signal are those in which
# The agent does not on average achieve reward close to the upper bound
# Many episodes result in high reward.

The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.

The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.

='''Experiments'''=
[[File:MRL8.png|border|right|400px]]
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.

In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.

==Implementation==
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.

==Environments==

The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.

The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.

The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.

==Multitask Learning==

[[File:MRL9.png|border|center|800px]]
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:

# Structured hierarchical reinforcement learners:
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)
# Alternative ways of incorporating sketch data into standard policy gradient methods:
#* learning an '''independent''' policy for each task
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch

The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.

Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.

==Ablations==
[[File:MRL10.png|border|right|400px]]
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.

To evaluate the the critic, consider three ablations:
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.

Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.

==Zero-shot and Adaptation Learning==
[[File:MRL11.png|border|left|320px]]
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.

='''Conclusion & Critique'''=
The paper's contributions are:

* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.

* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.

They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural sub policy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable sub policies which can be employed for zero-shot generalization when further sketches are available, and hierarchical reinforcement learning when they are not.

One critique of this approach could be that building of different neural networks for each sub tasks could lead to overtly complicated networks and is not in the spirit of building efficient structure.

='''References'''=
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.

[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.

[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.

[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.

[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.

[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.

[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.

[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.

[9] Author Jacob Andreas presenting the paper - https://www.youtube.com/watch?v=NRIcDEB64x8

[10] Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., & Agapiou, J. (2016). Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems (pp. 3486-3494).

[11] Parr, Ron and Russell, Stuart. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1998.

[12] Marthi, Bhaskara, Lantham, David, Guestrin, Carlos, and Russell, Stuart. Concurrent hierarchical reinforcement learning. In Proceedings of the Meeting of the Association for the Advancement of Artificial Intelligence, 2004.

STAT946F17/ Teaching Machines to Describe Images via Natural Language Feedback

2017-11-16T02:04:15Z

A2prasad: /* Crowd-sourcing Human Feedback */

= Introduction =
In the era of Artificial Intelligence, one should ideally be able to educate the robot about its mistakes, possibly without needing to dig into the underlying software. Reinforcement learning (RL) has become a standard way of training artificial agents that interact with an environment. Several works explored the idea of incorporating humans into the learning process, in order to help the reinforcement learning agent to learn faster. In most cases, the guidance comes in the form of a simple numerical (or “good”/“bad”) reward. In this work, natural language is used as a way to guide an RL agent. The author argues that a sentence provides a much stronger learning signal than a numeric reward in that we can easily point to where the mistakes occur and suggest how to correct them.

Here the goal is to allow a non-expert human teacher to give feedback to an RL agent in the form of natural language, just as one would to a learning child. The author has focused on the problem of image captioning, a task where the content of an image is described using sentences. This can also be seen as a multimodal problem where the whole network/model needs to combine the solution space of learning in both the image processing and text-generation domain. Image captioning is an application where the quality of the output can easily be judged by non-experts.

= Related Works =
Several works incorporate human feedback to help an RL agent learn faster.
#Thomaz et al. (2006) exploits humans in the loop to teach an agent to cook in a virtual kitchen. The users watch the agent learn and may intervene at any time to give a scalar reward. Reward shaping (Ng et al., 1999) is used to incorporate this information in the Markov Decision Process (MDP).
#Judah et al. (2010) iterates between “practice”, during which the agent interacts with the real environment, and a critique session where a human labels any subset of the chosen actions as good or bad.
#Griffith et al. (2013) proposes policy shaping which incorporates right/wrong feedback by utilizing it as direct policy labels.
#Mao et. al. propose a multimodal Recurrent Neural Network (m-RNN) for image captioning on 4 crucial datasets: IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO [14].Their approach involves a double network comprising of a deep RNN for sentence generation and a deep CNN for image learning.

Above approaches mostly assume that humans provide a numeric reward, unlike in this work where feedback is given in natural language. A few attempts have been made to advise an RL agent using language.
# Maclin et al. (1994) translated advice to a short program which was then implemented as a neural network. The units in this network represent Boolean concepts, which recognize whether the observed state satisfies the constraints given by the program. In such a case, the advice network will encourage the policy to take the suggested action.
# Weston et al. (2016) incorporates human feedback to improve a text-based question answering agent.
# Kaplan et al. (2017) exploits textual advice to improve training time of the A3C algorithm in playing an Atari game.

The authors propose the Phrase-based Image Captioning Model which is similar to most image captioning models except that it exploits attention and linguistic information. Several recent approaches trained the captioning model with policy gradients in order to directly optimize for the
desired performance metrics. This work follows the same line.

There is also similar efforts on dialogue based visual representation learning and conversation modeling. These models aim to mimic human-to-human conversations while in this work the human converses with and guides an artificial learning agent.

= Methodology =
The framework consists of a new phrase-based captioning model trained with Policy Gradients that incorporates natural language feedback provided by a human teacher. The phrase-based captioning model allows natural guidance by a non-expert.
=== Phrase-based Image Captioning ===
The captioning model uses a hierarchical recurrent neural network (RNN). The model is composed of a two-level LSTM, a phrase RNN at the top level, and a word RNN that generates a sequence of words for each phrase. One can think of the phrase RNN as providing a “topic” at each time step, which instructs the word RNN what to talk about. The structure of the model is explained through Figure 1.

[[File:modelham.png|center|500px|thumb|Figure 1: Hierarchical phrase-based captioning model, composed of a phrase-RNN at the top level, and a word level RNN which outputs a sequence of words for each phrase.]]

A convolutional neural network is used in order to extract a set of feature vectors $a = (a_1, \dots, a_n)$, with $a_j$ a feature in location j in the input image. These feature vectors are given to the attention layer. There are also two more inputs to the attention layer, current hidden state of the phrase-RNN and output of the label unit. The label unit predicts one out of four possible phrase labels, i.e., a noun (NP), preposition (PP), verb (VP), and conjunction phrase (CP). This information could be useful for the attention layer. For example, when we have a NP the model may look at objects in the image, while for VP it may focus on more global information. Computations can be expressed with the following equations:

$$
\begin{align*}
\small{\textrm{hidden state of the phrase-RNN at time step t}} \leftarrow h_t &= f_{phrase}(h_{t-1}, l_{t-1}, c_{t-1}, e_{t-1}) \\
\small{\text{output of the label unit}} \leftarrow l_t &= softmax(f_{phrase-label}(h_t)) \\
\small{\text{output of the attention layer}} \leftarrow c_t &= f_{att}(h_t, l_t, a)
\end{align*}
$$

After deciding about phrases, the outputs of phrase-RNN go to another LSTM to produce words for each phrase. $w_{t,i}$ denotes the i-th word output of the word-RNN in the t-th phrase. There is an additional <EOP> token in word-RNN’s vocabulary, which signals the end-of-phrase. Furthermore, $h_{t,i}$ denotes the i-th hidden state of the word-RNN for the t-th phrase.
$$
h_{t,i} = f_{word}(h_{t,i-1}, c_t, w_{t,i}) \\
w_{t,i} = f_{out}(h_{t,i}, c_t, w_{t,i-1}) \\
e_t = f_{word-phrase}(w_{t,1}, \dots ,w_{t,n})
$$

Note that $e_t$ encodes the generated phrase via simple mean-pooling over the words, which provides additional word-level context to the next phrase.

=== Crowd-sourcing Human Feedback ===
The authors have created a web interface that allows to collect feedback information. Figure 2 depicts the interface and an example of caption correction. There are two rounds of annotation. In the first round, the annotator is shown a captioned image and is asked to assess the quality of the caption, by choosing between: perfect, acceptable, grammar mistakes only, minor or major errors. They ask the annotators to choose minor and major error if the caption contained errors in semantics. They advise them to choose minor for small errors such as wrong or missing attributes or awkward prepositions, and go with major errors whenever any object or action naming is wrong. A visualization of this web based interface is provided in Figure 3.

[[File:crowd.png|600px|center|thumb|Figure 2: An example of a generated caption and its corresponding feedback]]
[[File:teaching 1.PNG|600px|center|thumb|Figure 3: Web based feedback collection interface]]
For the next (more detailed, and thus more costly) round of annotation, They only select captions which are not marked as either perfect or acceptable in the first round. Since these captions contain errors, the new annotator is required to provide detailed feedback about the mistakes. Annotators are asked to:
#Choose the type of required correction (something “ should be replaced”, “is missing”, or “should be deleted”)
#Write feedback in natural language (annotators are asked to describe a single mistake at a time)
#Mark the type of mistake (whether the mistake corresponds to an error in object, action, attribute, preposition, counting, or grammar)
#Highlight the word/phrase that contains the mistake
#Correct the chosen word/phrase
#Evaluate the quality of the caption after correction (it could be bad even after one round of correction)

Figure 3 shows the statics of the evaluations before and after one round of correction task. The authors acknowledge the costliness of the second round of annotation.

[[File:ham1.png|660px|center|thumb|Figure 3: Caption quality evaluation by the human annotators. Plot on the left shows evaluation for captions generated with the reference model (MLE). The right plot shows evaluation of the human-corrected captions (after completing at least one round of feedback).]]

=== Feedback Network ===

The collected feedback provides strong supervisory signal which can be used in the RL framework. In particular, the authors design a neural network (feedback network or FBN) which will provide additional reward based on the feedback sentence.

RL training will require us to generate samples (captions) from the model. Thus, during training, the sampled captions for each training image will differ from the reference maximum likelihood estimation (MLE) caption for which the feedback is provided. The goal of the feedback network is to read a newly sampled caption, and judge the correctness of each phrase conditioned on the feedback. This network performs the following computations:

[[File:fbn.JPG|550px|right|thumb|Figure 4: The architecture of the feedback network (FBN) that classifies each phrase in a sampled sentence (top left) as either correct, wrong or not relevant, by conditioning on the feedback sentence.]]

$$
h_t^{caption} = f_{sent}(h_{t-1}^{caption}, \omega_t^c) \\
h_t^{feedback} = f_{sent}(h_{t-1}^{feedback}, \omega_t^f) \\
q_i = f_{phrase}(\omega_{i,1}^c, \omega_{i,2}^c, \dots, \omega_{i,N}^c) \\
o_i = f_{fbn}(h_T^{caption}, h_T^{feedback }, q_i, m) \\
$$

Here, $\omega_t^c$ and $\omega_t^f$ denote the one-hot encoding of words in the sampled caption and feedback sentence for the t-th phrase, respectively. FBN encodes both the caption and feedback using an LSTM ($f_{sent}$), performs mean pooling ($f_{phrase}$) over the words in a phrase to represent the phrase i with $q_i$, and passes this information through a 3-layer MLP ($f_{fbn}$). The MLP accepts additional information about the mistake type (e.g., wrong object/action) encoded as a one hot vector m.

=== Policy Gradient Optimization using Natural Language Feedback ===

One can think of a caption decoder as an agent following a parameterized policy $p_\theta$ that selects an action at each time step. An “action” in our case requires choosing a word from the vocabulary (for the word RNN), or a phrase label (for the phrase RNN). The objective for learning the parameters of the model is the expected reward received when completing the caption $w^s = (w^s_1, \dots ,w^s_T)$. Here, $w_t^s$ is the word sampled from the model at time step t.

$$
L(\theta) = -\mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)]
$$
Such an objective function is non-differentiable. Thus policy gradients are used as in [13] to find the gradient of the objective function:
$$
\nabla_\theta L(\theta) = - \mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)\nabla_\theta \log p_\theta(w^s)]
$$
Which is estimated using a single Monte-Carlo sample:
$$
\nabla_\theta L(\theta) \approx - r(w^s)\nabla_\theta \log p_\theta(w^s)
$$
Then a baseline $b = r(\hat \omega)$ is used. A baseline does not change the expected gradient but can drastically reduce the variance.
$$
\hat{\omega}_t = argmax \ p(\omega_t|h_t) \\
\nabla_\theta L(\theta) \approx - (r(\omega^s) - r(\hat{\omega}))\nabla_\theta \log p_\theta(\omega^s)
$$
'''Reward:''' A sentence reward is defined as a weighted sum of the BLEU scores. BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Additionally, it was one of the first metrics to claim a high correlation with human judgements of quality [10, 11 and 12] and remains one of the most popular automated and inexpensive metrics (more information about BLUE score [http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf here] and a nice discussion on it [https://www.youtube.com/watch?v=ORHVgR-DVGg here]).

$$
r(\omega^s) = \beta \sum_i \lambda_i \cdot BLEU_i(\omega^s, ref)
$$

As reference captions to compute the reward, the authors either use the reference captions generated by a snapshot of the model which were evaluated as not having minor and major errors, or ground-truth captions. In addition, they weigh the reward by the caption quality as provided by the annotators (e.g. $\beta = 1$ for perfect and $\beta = 0.8$ for acceptable). They further incorporate the reward provided by the feedback network:
$$
r(\omega_t^p) = r(\omega^s) + \lambda_f f_{fbn}(\omega^s, feedback, \omega_t^p)
$$
Where $\omega^p_t$ denotes the sequence of words in the t-th phrase. Note that FBN produces a classification of each phrase. This can be converted into reward, by assigning
correct to 1, wrong to -1, and 0 to not relevant. So the final gradient takes the following form:
$$
\nabla_\theta L(\theta) = - \sum_{p=1}^{P}(r(\omega^p) - r(\hat{\omega}^p))\nabla_\theta \log p_\theta(\omega^p)
$$

= Experimental Results =
The authors used MS-COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with key points. They used 82K images for training, 2K for validation, and 4K for testing. To collect feedback, they randomly chose 7K images from the training set, as well as all 2K images from validation. In addition, they use a word vocabulary size of 23,115.

=== Phrase-based captioning model ===
The authors analyze different instantiations of their phrase-based captioning in the following table. To sanity check, they compare it to a flat approach (word-RNN only). Overall, their model performs slightly worse (0.66 points). However, the main strength of their model is that it allows a more natural integration with feedback.

[[File:table2.JPG|center]]

=== Feedback network ===
The authors use 9000 images to collect feedback; 5150 of them are evaluated as containing errors. Finally, they use 4174 images for the second round of annotation. They randomly select 9/10 of them to serve as a training set for feedback network, and 1/10 of them to be test set. The model achieves the highest accuracy of 74.66% when they provide it with the kind of mistake the reference caption had (e.g. an object, action, etc). This is not particularly surprising as it requires the most additional information to train the model and the most time to compile the dataset for.

=== RL with Natural Language Feedback ===
The following table reports the performance of several instantiations of the RL models. All models have been pre-trained using cross-entropy loss (MLE) on the full MS-COCO training set. For the next rounds of training, all the models are trained only on the 9K images.

The authors define “C” captions as all captions that were corrected by the annotators and were not evaluated as containing minor or major error, and ground-truth captions for the rest of the images. For “A”, they use all captions (including captions which were evaluated as correct) that did not have minor or major errors, and GT for the rest. A detailed break-down of these captions is reported in in the following table. The authors test their model in two separate cases:

*They first test a model using the standard cross-entropy loss, but which now also has access to the corrected captions in addition to the 5GT captions. This model (MLEC) is able to improve over the original MLE model by 1.4 points. They then test the RL model by optimizing the metric wrt the 5GT captions. This brings an additional point, achieving 2.4 over the MLE model. Next, the RL agent is given access to 3GT captions, the “C" captions and feedback sentences. They show that this model outperforms the no-feedback baseline by 0.5 points. If the RL agent has access to 4GT captions and feedback descriptions, a total of 1.1 points over the baseline RL model and 3.5 over the MLE model will be achieved.

*They also test a more realistic scenario, in which the models have access to either a single GT caption, “C" (or “A”), and feedback. This mimics a scenario in which the human teacher observes the agent and either gives feedback about the agent’s mistakes, or, if the agent’s caption is completely wrong, the teacher writes a new caption. Interestingly, RL when provided with the corrected captions performs better than when given GT captions. Overall, their model outperforms the base RL (no feedback) by 1.2 points.

[[File:table3.PNG|center]]

These experiments make an important point. Instead of giving the RL agent a completely new target (caption), a better strategy is to “teach” the agent about the mistakes it is doing and suggest a correction. This is not very difficult to understand intuitively - informing the agent of its error indeed conveys more information than teaching it a completely correct answer, because the latter forces the network to "train" its memory from a sample which is, at least seemingly, insulated from its prior memory.

= Conclusion =
In this paper, a human teacher is enabled to provide feedback to the learning agent in the form of natural language. The authors focused on the problem of image captioning. They proposed a hierarchical phrase-based RNN as the captioning model, which allowed natural integration with human feedback.
They also crowd-sourced feedback, and showed how to incorporate it in policy gradient optimization.

= Comments =
In the hierarchical phrase-based RNN, human involving is a key part of improving the performance of the network. According to this paper, the feedback LSTTM network is capable of handling simple sentences. What if the feedback is weak or even ambiguous? Is there a threshold for the feedback such that the network can refuse a wrong feedback? Follow this architecture, it would be interesting to see whether such feedback strategy can be applied in machine translation.

= References=
[1] Huan Ling and Sanja Fidler. Teaching Machines to Describe Images via Natural Language Feedback. In arXiv:1706.00130, 2017.

[2] Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea Lockerd Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In NIPS, 2013.

[3] K. Judah, S. Roy, A. Fern, and T. Dietterich. Reinforcement learning via practice and critique advice. In AAAI, 2010.

[4] A. Thomaz and C. Breazeal. Reinforcement learning with human teachers: Evidence of feedback and guidance. In AAAI, 2006.

[5] Richard Maclin and Jude W. Shavlik. Incorporating advice into agents that learn from reinforcements. In National Conference on Artificial Intelligence, pages 694–699, 1994.

[6] Jason Weston. Dialog-based language learning. In arXiv:1604.06045, 2016.

[7] Russell Kaplan, Christopher Sauer, and Alexander Sosa. Beating atari with natural language guided reinforcement learning. In arXiv:1704.05539, 2017.

[8] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, pages 278–287, 1999.

[9] COCO Dataset http://cocodataset.org/#home

[10] https://en.wikipedia.org/wiki/BLEU

[11] Papineni, K., Roukos, S., Ward, T., Henderson, J and Reeder, F. (2002). “Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results” in Proceedings of Human Language Technology 2002, San Diego, pp. 132–137

[12] Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006 pp. 249–256

[13] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, Kevin Murphy. "Improved Image Captioning via Policy Gradient optimization of SPIDEr". Under review for ICCV 2017.

[14] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille. "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)".arXiv:1412.6632

STAT946F17/ Teaching Machines to Describe Images via Natural Language Feedback

2017-11-16T02:01:53Z

A2prasad: /* Crowd-sourcing Human Feedback */

= Introduction =
In the era of Artificial Intelligence, one should ideally be able to educate the robot about its mistakes, possibly without needing to dig into the underlying software. Reinforcement learning (RL) has become a standard way of training artificial agents that interact with an environment. Several works explored the idea of incorporating humans into the learning process, in order to help the reinforcement learning agent to learn faster. In most cases, the guidance comes in the form of a simple numerical (or “good”/“bad”) reward. In this work, natural language is used as a way to guide an RL agent. The author argues that a sentence provides a much stronger learning signal than a numeric reward in that we can easily point to where the mistakes occur and suggest how to correct them.

Here the goal is to allow a non-expert human teacher to give feedback to an RL agent in the form of natural language, just as one would to a learning child. The author has focused on the problem of image captioning, a task where the content of an image is described using sentences. This can also be seen as a multimodal problem where the whole network/model needs to combine the solution space of learning in both the image processing and text-generation domain. Image captioning is an application where the quality of the output can easily be judged by non-experts.

= Related Works =
Several works incorporate human feedback to help an RL agent learn faster.
#Thomaz et al. (2006) exploits humans in the loop to teach an agent to cook in a virtual kitchen. The users watch the agent learn and may intervene at any time to give a scalar reward. Reward shaping (Ng et al., 1999) is used to incorporate this information in the Markov Decision Process (MDP).
#Judah et al. (2010) iterates between “practice”, during which the agent interacts with the real environment, and a critique session where a human labels any subset of the chosen actions as good or bad.
#Griffith et al. (2013) proposes policy shaping which incorporates right/wrong feedback by utilizing it as direct policy labels.
#Mao et. al. propose a multimodal Recurrent Neural Network (m-RNN) for image captioning on 4 crucial datasets: IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO [14].Their approach involves a double network comprising of a deep RNN for sentence generation and a deep CNN for image learning.

Above approaches mostly assume that humans provide a numeric reward, unlike in this work where feedback is given in natural language. A few attempts have been made to advise an RL agent using language.
# Maclin et al. (1994) translated advice to a short program which was then implemented as a neural network. The units in this network represent Boolean concepts, which recognize whether the observed state satisfies the constraints given by the program. In such a case, the advice network will encourage the policy to take the suggested action.
# Weston et al. (2016) incorporates human feedback to improve a text-based question answering agent.
# Kaplan et al. (2017) exploits textual advice to improve training time of the A3C algorithm in playing an Atari game.

The authors propose the Phrase-based Image Captioning Model which is similar to most image captioning models except that it exploits attention and linguistic information. Several recent approaches trained the captioning model with policy gradients in order to directly optimize for the
desired performance metrics. This work follows the same line.

There is also similar efforts on dialogue based visual representation learning and conversation modeling. These models aim to mimic human-to-human conversations while in this work the human converses with and guides an artificial learning agent.

= Methodology =
The framework consists of a new phrase-based captioning model trained with Policy Gradients that incorporates natural language feedback provided by a human teacher. The phrase-based captioning model allows natural guidance by a non-expert.
=== Phrase-based Image Captioning ===
The captioning model uses a hierarchical recurrent neural network (RNN). The model is composed of a two-level LSTM, a phrase RNN at the top level, and a word RNN that generates a sequence of words for each phrase. One can think of the phrase RNN as providing a “topic” at each time step, which instructs the word RNN what to talk about. The structure of the model is explained through Figure 1.

[[File:modelham.png|center|500px|thumb|Figure 1: Hierarchical phrase-based captioning model, composed of a phrase-RNN at the top level, and a word level RNN which outputs a sequence of words for each phrase.]]

A convolutional neural network is used in order to extract a set of feature vectors $a = (a_1, \dots, a_n)$, with $a_j$ a feature in location j in the input image. These feature vectors are given to the attention layer. There are also two more inputs to the attention layer, current hidden state of the phrase-RNN and output of the label unit. The label unit predicts one out of four possible phrase labels, i.e., a noun (NP), preposition (PP), verb (VP), and conjunction phrase (CP). This information could be useful for the attention layer. For example, when we have a NP the model may look at objects in the image, while for VP it may focus on more global information. Computations can be expressed with the following equations:

$$
\begin{align*}
\small{\textrm{hidden state of the phrase-RNN at time step t}} \leftarrow h_t &= f_{phrase}(h_{t-1}, l_{t-1}, c_{t-1}, e_{t-1}) \\
\small{\text{output of the label unit}} \leftarrow l_t &= softmax(f_{phrase-label}(h_t)) \\
\small{\text{output of the attention layer}} \leftarrow c_t &= f_{att}(h_t, l_t, a)
\end{align*}
$$

After deciding about phrases, the outputs of phrase-RNN go to another LSTM to produce words for each phrase. $w_{t,i}$ denotes the i-th word output of the word-RNN in the t-th phrase. There is an additional <EOP> token in word-RNN’s vocabulary, which signals the end-of-phrase. Furthermore, $h_{t,i}$ denotes the i-th hidden state of the word-RNN for the t-th phrase.
$$
h_{t,i} = f_{word}(h_{t,i-1}, c_t, w_{t,i}) \\
w_{t,i} = f_{out}(h_{t,i}, c_t, w_{t,i-1}) \\
e_t = f_{word-phrase}(w_{t,1}, \dots ,w_{t,n})
$$

Note that $e_t$ encodes the generated phrase via simple mean-pooling over the words, which provides additional word-level context to the next phrase.

=== Crowd-sourcing Human Feedback ===
The authors have created a web interface that allows to collect feedback information. Figure 2 depicts the interface and an example of caption correction. There are two rounds of annotation. In the first round, the annotator is shown a captioned image and is asked to assess the quality of the caption, by choosing between: perfect, acceptable, grammar mistakes only, minor or major errors. They ask the annotators to choose minor and major error if the caption contained errors in semantics. They advise them to choose minor for small errors such as wrong or missing attributes or awkward prepositions, and go with major errors whenever any object or action naming is wrong.

[[File:crowd.png|600px|center|thumb|Figure 2: An example of a generated caption and its corresponding feedback]]
[[File:teaching 1.PNG|600px|center|thumb|Figure 3: Web based feedback collection interface]]
For the next (more detailed, and thus more costly) round of annotation, They only select captions which are not marked as either perfect or acceptable in the first round. Since these captions contain errors, the new annotator is required to provide detailed feedback about the mistakes. Annotators are asked to:
#Choose the type of required correction (something “ should be replaced”, “is missing”, or “should be deleted”)
#Write feedback in natural language (annotators are asked to describe a single mistake at a time)
#Mark the type of mistake (whether the mistake corresponds to an error in object, action, attribute, preposition, counting, or grammar)
#Highlight the word/phrase that contains the mistake
#Correct the chosen word/phrase
#Evaluate the quality of the caption after correction (it could be bad even after one round of correction)

Figure 3 shows the statics of the evaluations before and after one round of correction task. The authors acknowledge the costliness of the second round of annotation.

[[File:ham1.png|660px|center|thumb|Figure 3: Caption quality evaluation by the human annotators. Plot on the left shows evaluation for captions generated with the reference model (MLE). The right plot shows evaluation of the human-corrected captions (after completing at least one round of feedback).]]

=== Feedback Network ===

The collected feedback provides strong supervisory signal which can be used in the RL framework. In particular, the authors design a neural network (feedback network or FBN) which will provide additional reward based on the feedback sentence.

RL training will require us to generate samples (captions) from the model. Thus, during training, the sampled captions for each training image will differ from the reference maximum likelihood estimation (MLE) caption for which the feedback is provided. The goal of the feedback network is to read a newly sampled caption, and judge the correctness of each phrase conditioned on the feedback. This network performs the following computations:

[[File:fbn.JPG|550px|right|thumb|Figure 4: The architecture of the feedback network (FBN) that classifies each phrase in a sampled sentence (top left) as either correct, wrong or not relevant, by conditioning on the feedback sentence.]]

$$
h_t^{caption} = f_{sent}(h_{t-1}^{caption}, \omega_t^c) \\
h_t^{feedback} = f_{sent}(h_{t-1}^{feedback}, \omega_t^f) \\
q_i = f_{phrase}(\omega_{i,1}^c, \omega_{i,2}^c, \dots, \omega_{i,N}^c) \\
o_i = f_{fbn}(h_T^{caption}, h_T^{feedback }, q_i, m) \\
$$

Here, $\omega_t^c$ and $\omega_t^f$ denote the one-hot encoding of words in the sampled caption and feedback sentence for the t-th phrase, respectively. FBN encodes both the caption and feedback using an LSTM ($f_{sent}$), performs mean pooling ($f_{phrase}$) over the words in a phrase to represent the phrase i with $q_i$, and passes this information through a 3-layer MLP ($f_{fbn}$). The MLP accepts additional information about the mistake type (e.g., wrong object/action) encoded as a one hot vector m.

=== Policy Gradient Optimization using Natural Language Feedback ===

One can think of a caption decoder as an agent following a parameterized policy $p_\theta$ that selects an action at each time step. An “action” in our case requires choosing a word from the vocabulary (for the word RNN), or a phrase label (for the phrase RNN). The objective for learning the parameters of the model is the expected reward received when completing the caption $w^s = (w^s_1, \dots ,w^s_T)$. Here, $w_t^s$ is the word sampled from the model at time step t.

$$
L(\theta) = -\mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)]
$$
Such an objective function is non-differentiable. Thus policy gradients are used as in [13] to find the gradient of the objective function:
$$
\nabla_\theta L(\theta) = - \mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)\nabla_\theta \log p_\theta(w^s)]
$$
Which is estimated using a single Monte-Carlo sample:
$$
\nabla_\theta L(\theta) \approx - r(w^s)\nabla_\theta \log p_\theta(w^s)
$$
Then a baseline $b = r(\hat \omega)$ is used. A baseline does not change the expected gradient but can drastically reduce the variance.
$$
\hat{\omega}_t = argmax \ p(\omega_t|h_t) \\
\nabla_\theta L(\theta) \approx - (r(\omega^s) - r(\hat{\omega}))\nabla_\theta \log p_\theta(\omega^s)
$$
'''Reward:''' A sentence reward is defined as a weighted sum of the BLEU scores. BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Additionally, it was one of the first metrics to claim a high correlation with human judgements of quality [10, 11 and 12] and remains one of the most popular automated and inexpensive metrics (more information about BLUE score [http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf here] and a nice discussion on it [https://www.youtube.com/watch?v=ORHVgR-DVGg here]).

$$
r(\omega^s) = \beta \sum_i \lambda_i \cdot BLEU_i(\omega^s, ref)
$$

As reference captions to compute the reward, the authors either use the reference captions generated by a snapshot of the model which were evaluated as not having minor and major errors, or ground-truth captions. In addition, they weigh the reward by the caption quality as provided by the annotators (e.g. $\beta = 1$ for perfect and $\beta = 0.8$ for acceptable). They further incorporate the reward provided by the feedback network:
$$
r(\omega_t^p) = r(\omega^s) + \lambda_f f_{fbn}(\omega^s, feedback, \omega_t^p)
$$
Where $\omega^p_t$ denotes the sequence of words in the t-th phrase. Note that FBN produces a classification of each phrase. This can be converted into reward, by assigning
correct to 1, wrong to -1, and 0 to not relevant. So the final gradient takes the following form:
$$
\nabla_\theta L(\theta) = - \sum_{p=1}^{P}(r(\omega^p) - r(\hat{\omega}^p))\nabla_\theta \log p_\theta(\omega^p)
$$

= Experimental Results =
The authors used MS-COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with key points. They used 82K images for training, 2K for validation, and 4K for testing. To collect feedback, they randomly chose 7K images from the training set, as well as all 2K images from validation. In addition, they use a word vocabulary size of 23,115.

=== Phrase-based captioning model ===
The authors analyze different instantiations of their phrase-based captioning in the following table. To sanity check, they compare it to a flat approach (word-RNN only). Overall, their model performs slightly worse (0.66 points). However, the main strength of their model is that it allows a more natural integration with feedback.

[[File:table2.JPG|center]]

=== Feedback network ===
The authors use 9000 images to collect feedback; 5150 of them are evaluated as containing errors. Finally, they use 4174 images for the second round of annotation. They randomly select 9/10 of them to serve as a training set for feedback network, and 1/10 of them to be test set. The model achieves the highest accuracy of 74.66% when they provide it with the kind of mistake the reference caption had (e.g. an object, action, etc). This is not particularly surprising as it requires the most additional information to train the model and the most time to compile the dataset for.

=== RL with Natural Language Feedback ===
The following table reports the performance of several instantiations of the RL models. All models have been pre-trained using cross-entropy loss (MLE) on the full MS-COCO training set. For the next rounds of training, all the models are trained only on the 9K images.

The authors define “C” captions as all captions that were corrected by the annotators and were not evaluated as containing minor or major error, and ground-truth captions for the rest of the images. For “A”, they use all captions (including captions which were evaluated as correct) that did not have minor or major errors, and GT for the rest. A detailed break-down of these captions is reported in in the following table. The authors test their model in two separate cases:

*They first test a model using the standard cross-entropy loss, but which now also has access to the corrected captions in addition to the 5GT captions. This model (MLEC) is able to improve over the original MLE model by 1.4 points. They then test the RL model by optimizing the metric wrt the 5GT captions. This brings an additional point, achieving 2.4 over the MLE model. Next, the RL agent is given access to 3GT captions, the “C" captions and feedback sentences. They show that this model outperforms the no-feedback baseline by 0.5 points. If the RL agent has access to 4GT captions and feedback descriptions, a total of 1.1 points over the baseline RL model and 3.5 over the MLE model will be achieved.

*They also test a more realistic scenario, in which the models have access to either a single GT caption, “C" (or “A”), and feedback. This mimics a scenario in which the human teacher observes the agent and either gives feedback about the agent’s mistakes, or, if the agent’s caption is completely wrong, the teacher writes a new caption. Interestingly, RL when provided with the corrected captions performs better than when given GT captions. Overall, their model outperforms the base RL (no feedback) by 1.2 points.

[[File:table3.PNG|center]]

These experiments make an important point. Instead of giving the RL agent a completely new target (caption), a better strategy is to “teach” the agent about the mistakes it is doing and suggest a correction. This is not very difficult to understand intuitively - informing the agent of its error indeed conveys more information than teaching it a completely correct answer, because the latter forces the network to "train" its memory from a sample which is, at least seemingly, insulated from its prior memory.

= Conclusion =
In this paper, a human teacher is enabled to provide feedback to the learning agent in the form of natural language. The authors focused on the problem of image captioning. They proposed a hierarchical phrase-based RNN as the captioning model, which allowed natural integration with human feedback.
They also crowd-sourced feedback, and showed how to incorporate it in policy gradient optimization.

= Comments =
In the hierarchical phrase-based RNN, human involving is a key part of improving the performance of the network. According to this paper, the feedback LSTTM network is capable of handling simple sentences. What if the feedback is weak or even ambiguous? Is there a threshold for the feedback such that the network can refuse a wrong feedback? Follow this architecture, it would be interesting to see whether such feedback strategy can be applied in machine translation.

= References=
[1] Huan Ling and Sanja Fidler. Teaching Machines to Describe Images via Natural Language Feedback. In arXiv:1706.00130, 2017.

[2] Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea Lockerd Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In NIPS, 2013.

[3] K. Judah, S. Roy, A. Fern, and T. Dietterich. Reinforcement learning via practice and critique advice. In AAAI, 2010.

[4] A. Thomaz and C. Breazeal. Reinforcement learning with human teachers: Evidence of feedback and guidance. In AAAI, 2006.

[5] Richard Maclin and Jude W. Shavlik. Incorporating advice into agents that learn from reinforcements. In National Conference on Artificial Intelligence, pages 694–699, 1994.

[6] Jason Weston. Dialog-based language learning. In arXiv:1604.06045, 2016.

[7] Russell Kaplan, Christopher Sauer, and Alexander Sosa. Beating atari with natural language guided reinforcement learning. In arXiv:1704.05539, 2017.

[8] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, pages 278–287, 1999.

[9] COCO Dataset http://cocodataset.org/#home

[10] https://en.wikipedia.org/wiki/BLEU

[11] Papineni, K., Roukos, S., Ward, T., Henderson, J and Reeder, F. (2002). “Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results” in Proceedings of Human Language Technology 2002, San Diego, pp. 132–137

[12] Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006 pp. 249–256

[13] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, Kevin Murphy. "Improved Image Captioning via Policy Gradient optimization of SPIDEr". Under review for ICCV 2017.

[14] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille. "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)".arXiv:1412.6632

STAT946F17/ Teaching Machines to Describe Images via Natural Language Feedback

2017-11-16T02:00:30Z

A2prasad: /* Crowd-sourcing Human Feedback */

= Introduction =
In the era of Artificial Intelligence, one should ideally be able to educate the robot about its mistakes, possibly without needing to dig into the underlying software. Reinforcement learning (RL) has become a standard way of training artificial agents that interact with an environment. Several works explored the idea of incorporating humans into the learning process, in order to help the reinforcement learning agent to learn faster. In most cases, the guidance comes in the form of a simple numerical (or “good”/“bad”) reward. In this work, natural language is used as a way to guide an RL agent. The author argues that a sentence provides a much stronger learning signal than a numeric reward in that we can easily point to where the mistakes occur and suggest how to correct them.

Here the goal is to allow a non-expert human teacher to give feedback to an RL agent in the form of natural language, just as one would to a learning child. The author has focused on the problem of image captioning, a task where the content of an image is described using sentences. This can also be seen as a multimodal problem where the whole network/model needs to combine the solution space of learning in both the image processing and text-generation domain. Image captioning is an application where the quality of the output can easily be judged by non-experts.

= Related Works =
Several works incorporate human feedback to help an RL agent learn faster.
#Thomaz et al. (2006) exploits humans in the loop to teach an agent to cook in a virtual kitchen. The users watch the agent learn and may intervene at any time to give a scalar reward. Reward shaping (Ng et al., 1999) is used to incorporate this information in the Markov Decision Process (MDP).
#Judah et al. (2010) iterates between “practice”, during which the agent interacts with the real environment, and a critique session where a human labels any subset of the chosen actions as good or bad.
#Griffith et al. (2013) proposes policy shaping which incorporates right/wrong feedback by utilizing it as direct policy labels.
#Mao et. al. propose a multimodal Recurrent Neural Network (m-RNN) for image captioning on 4 crucial datasets: IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO [14].Their approach involves a double network comprising of a deep RNN for sentence generation and a deep CNN for image learning.

Above approaches mostly assume that humans provide a numeric reward, unlike in this work where feedback is given in natural language. A few attempts have been made to advise an RL agent using language.
# Maclin et al. (1994) translated advice to a short program which was then implemented as a neural network. The units in this network represent Boolean concepts, which recognize whether the observed state satisfies the constraints given by the program. In such a case, the advice network will encourage the policy to take the suggested action.
# Weston et al. (2016) incorporates human feedback to improve a text-based question answering agent.
# Kaplan et al. (2017) exploits textual advice to improve training time of the A3C algorithm in playing an Atari game.

The authors propose the Phrase-based Image Captioning Model which is similar to most image captioning models except that it exploits attention and linguistic information. Several recent approaches trained the captioning model with policy gradients in order to directly optimize for the
desired performance metrics. This work follows the same line.

There is also similar efforts on dialogue based visual representation learning and conversation modeling. These models aim to mimic human-to-human conversations while in this work the human converses with and guides an artificial learning agent.

= Methodology =
The framework consists of a new phrase-based captioning model trained with Policy Gradients that incorporates natural language feedback provided by a human teacher. The phrase-based captioning model allows natural guidance by a non-expert.
=== Phrase-based Image Captioning ===
The captioning model uses a hierarchical recurrent neural network (RNN). The model is composed of a two-level LSTM, a phrase RNN at the top level, and a word RNN that generates a sequence of words for each phrase. One can think of the phrase RNN as providing a “topic” at each time step, which instructs the word RNN what to talk about. The structure of the model is explained through Figure 1.

[[File:modelham.png|center|500px|thumb|Figure 1: Hierarchical phrase-based captioning model, composed of a phrase-RNN at the top level, and a word level RNN which outputs a sequence of words for each phrase.]]

A convolutional neural network is used in order to extract a set of feature vectors $a = (a_1, \dots, a_n)$, with $a_j$ a feature in location j in the input image. These feature vectors are given to the attention layer. There are also two more inputs to the attention layer, current hidden state of the phrase-RNN and output of the label unit. The label unit predicts one out of four possible phrase labels, i.e., a noun (NP), preposition (PP), verb (VP), and conjunction phrase (CP). This information could be useful for the attention layer. For example, when we have a NP the model may look at objects in the image, while for VP it may focus on more global information. Computations can be expressed with the following equations:

$$
\begin{align*}
\small{\textrm{hidden state of the phrase-RNN at time step t}} \leftarrow h_t &= f_{phrase}(h_{t-1}, l_{t-1}, c_{t-1}, e_{t-1}) \\
\small{\text{output of the label unit}} \leftarrow l_t &= softmax(f_{phrase-label}(h_t)) \\
\small{\text{output of the attention layer}} \leftarrow c_t &= f_{att}(h_t, l_t, a)
\end{align*}
$$

After deciding about phrases, the outputs of phrase-RNN go to another LSTM to produce words for each phrase. $w_{t,i}$ denotes the i-th word output of the word-RNN in the t-th phrase. There is an additional <EOP> token in word-RNN’s vocabulary, which signals the end-of-phrase. Furthermore, $h_{t,i}$ denotes the i-th hidden state of the word-RNN for the t-th phrase.
$$
h_{t,i} = f_{word}(h_{t,i-1}, c_t, w_{t,i}) \\
w_{t,i} = f_{out}(h_{t,i}, c_t, w_{t,i-1}) \\
e_t = f_{word-phrase}(w_{t,1}, \dots ,w_{t,n})
$$

Note that $e_t$ encodes the generated phrase via simple mean-pooling over the words, which provides additional word-level context to the next phrase.

=== Crowd-sourcing Human Feedback ===
The authors have created a web interface that allows to collect feedback information. Figure 2 depicts the interface and an example of caption correction. There are two rounds of annotation. In the first round, the annotator is shown a captioned image and is asked to assess the quality of the caption, by choosing between: perfect, acceptable, grammar mistakes only, minor or major errors. They ask the annotators to choose minor and major error if the caption contained errors in semantics. They advise them to choose minor for small errors such as wrong or missing attributes or awkward prepositions, and go with major errors whenever any object or action naming is wrong.

[[File:crowd.png|600px|center|thumb|Figure 2: An example of a generated caption and its corresponding feedback]]
[[File:teaching 1.png|600px|center|thumb|Figure 3: Web based feedback collection interface]]
For the next (more detailed, and thus more costly) round of annotation, They only select captions which are not marked as either perfect or acceptable in the first round. Since these captions contain errors, the new annotator is required to provide detailed feedback about the mistakes. Annotators are asked to:
#Choose the type of required correction (something “ should be replaced”, “is missing”, or “should be deleted”)
#Write feedback in natural language (annotators are asked to describe a single mistake at a time)
#Mark the type of mistake (whether the mistake corresponds to an error in object, action, attribute, preposition, counting, or grammar)
#Highlight the word/phrase that contains the mistake
#Correct the chosen word/phrase
#Evaluate the quality of the caption after correction (it could be bad even after one round of correction)

Figure 3 shows the statics of the evaluations before and after one round of correction task. The authors acknowledge the costliness of the second round of annotation.

[[File:ham1.png|660px|center|thumb|Figure 3: Caption quality evaluation by the human annotators. Plot on the left shows evaluation for captions generated with the reference model (MLE). The right plot shows evaluation of the human-corrected captions (after completing at least one round of feedback).]]

=== Feedback Network ===

The collected feedback provides strong supervisory signal which can be used in the RL framework. In particular, the authors design a neural network (feedback network or FBN) which will provide additional reward based on the feedback sentence.

RL training will require us to generate samples (captions) from the model. Thus, during training, the sampled captions for each training image will differ from the reference maximum likelihood estimation (MLE) caption for which the feedback is provided. The goal of the feedback network is to read a newly sampled caption, and judge the correctness of each phrase conditioned on the feedback. This network performs the following computations:

[[File:fbn.JPG|550px|right|thumb|Figure 4: The architecture of the feedback network (FBN) that classifies each phrase in a sampled sentence (top left) as either correct, wrong or not relevant, by conditioning on the feedback sentence.]]

$$
h_t^{caption} = f_{sent}(h_{t-1}^{caption}, \omega_t^c) \\
h_t^{feedback} = f_{sent}(h_{t-1}^{feedback}, \omega_t^f) \\
q_i = f_{phrase}(\omega_{i,1}^c, \omega_{i,2}^c, \dots, \omega_{i,N}^c) \\
o_i = f_{fbn}(h_T^{caption}, h_T^{feedback }, q_i, m) \\
$$

Here, $\omega_t^c$ and $\omega_t^f$ denote the one-hot encoding of words in the sampled caption and feedback sentence for the t-th phrase, respectively. FBN encodes both the caption and feedback using an LSTM ($f_{sent}$), performs mean pooling ($f_{phrase}$) over the words in a phrase to represent the phrase i with $q_i$, and passes this information through a 3-layer MLP ($f_{fbn}$). The MLP accepts additional information about the mistake type (e.g., wrong object/action) encoded as a one hot vector m.

=== Policy Gradient Optimization using Natural Language Feedback ===

One can think of a caption decoder as an agent following a parameterized policy $p_\theta$ that selects an action at each time step. An “action” in our case requires choosing a word from the vocabulary (for the word RNN), or a phrase label (for the phrase RNN). The objective for learning the parameters of the model is the expected reward received when completing the caption $w^s = (w^s_1, \dots ,w^s_T)$. Here, $w_t^s$ is the word sampled from the model at time step t.

$$
L(\theta) = -\mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)]
$$
Such an objective function is non-differentiable. Thus policy gradients are used as in [13] to find the gradient of the objective function:
$$
\nabla_\theta L(\theta) = - \mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)\nabla_\theta \log p_\theta(w^s)]
$$
Which is estimated using a single Monte-Carlo sample:
$$
\nabla_\theta L(\theta) \approx - r(w^s)\nabla_\theta \log p_\theta(w^s)
$$
Then a baseline $b = r(\hat \omega)$ is used. A baseline does not change the expected gradient but can drastically reduce the variance.
$$
\hat{\omega}_t = argmax \ p(\omega_t|h_t) \\
\nabla_\theta L(\theta) \approx - (r(\omega^s) - r(\hat{\omega}))\nabla_\theta \log p_\theta(\omega^s)
$$
'''Reward:''' A sentence reward is defined as a weighted sum of the BLEU scores. BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Additionally, it was one of the first metrics to claim a high correlation with human judgements of quality [10, 11 and 12] and remains one of the most popular automated and inexpensive metrics (more information about BLUE score [http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf here] and a nice discussion on it [https://www.youtube.com/watch?v=ORHVgR-DVGg here]).

$$
r(\omega^s) = \beta \sum_i \lambda_i \cdot BLEU_i(\omega^s, ref)
$$

As reference captions to compute the reward, the authors either use the reference captions generated by a snapshot of the model which were evaluated as not having minor and major errors, or ground-truth captions. In addition, they weigh the reward by the caption quality as provided by the annotators (e.g. $\beta = 1$ for perfect and $\beta = 0.8$ for acceptable). They further incorporate the reward provided by the feedback network:
$$
r(\omega_t^p) = r(\omega^s) + \lambda_f f_{fbn}(\omega^s, feedback, \omega_t^p)
$$
Where $\omega^p_t$ denotes the sequence of words in the t-th phrase. Note that FBN produces a classification of each phrase. This can be converted into reward, by assigning
correct to 1, wrong to -1, and 0 to not relevant. So the final gradient takes the following form:
$$
\nabla_\theta L(\theta) = - \sum_{p=1}^{P}(r(\omega^p) - r(\hat{\omega}^p))\nabla_\theta \log p_\theta(\omega^p)
$$

= Experimental Results =
The authors used MS-COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with key points. They used 82K images for training, 2K for validation, and 4K for testing. To collect feedback, they randomly chose 7K images from the training set, as well as all 2K images from validation. In addition, they use a word vocabulary size of 23,115.

=== Phrase-based captioning model ===
The authors analyze different instantiations of their phrase-based captioning in the following table. To sanity check, they compare it to a flat approach (word-RNN only). Overall, their model performs slightly worse (0.66 points). However, the main strength of their model is that it allows a more natural integration with feedback.

[[File:table2.JPG|center]]

=== Feedback network ===
The authors use 9000 images to collect feedback; 5150 of them are evaluated as containing errors. Finally, they use 4174 images for the second round of annotation. They randomly select 9/10 of them to serve as a training set for feedback network, and 1/10 of them to be test set. The model achieves the highest accuracy of 74.66% when they provide it with the kind of mistake the reference caption had (e.g. an object, action, etc). This is not particularly surprising as it requires the most additional information to train the model and the most time to compile the dataset for.

=== RL with Natural Language Feedback ===
The following table reports the performance of several instantiations of the RL models. All models have been pre-trained using cross-entropy loss (MLE) on the full MS-COCO training set. For the next rounds of training, all the models are trained only on the 9K images.

The authors define “C” captions as all captions that were corrected by the annotators and were not evaluated as containing minor or major error, and ground-truth captions for the rest of the images. For “A”, they use all captions (including captions which were evaluated as correct) that did not have minor or major errors, and GT for the rest. A detailed break-down of these captions is reported in in the following table. The authors test their model in two separate cases:

*They first test a model using the standard cross-entropy loss, but which now also has access to the corrected captions in addition to the 5GT captions. This model (MLEC) is able to improve over the original MLE model by 1.4 points. They then test the RL model by optimizing the metric wrt the 5GT captions. This brings an additional point, achieving 2.4 over the MLE model. Next, the RL agent is given access to 3GT captions, the “C" captions and feedback sentences. They show that this model outperforms the no-feedback baseline by 0.5 points. If the RL agent has access to 4GT captions and feedback descriptions, a total of 1.1 points over the baseline RL model and 3.5 over the MLE model will be achieved.

*They also test a more realistic scenario, in which the models have access to either a single GT caption, “C" (or “A”), and feedback. This mimics a scenario in which the human teacher observes the agent and either gives feedback about the agent’s mistakes, or, if the agent’s caption is completely wrong, the teacher writes a new caption. Interestingly, RL when provided with the corrected captions performs better than when given GT captions. Overall, their model outperforms the base RL (no feedback) by 1.2 points.

[[File:table3.PNG|center]]

These experiments make an important point. Instead of giving the RL agent a completely new target (caption), a better strategy is to “teach” the agent about the mistakes it is doing and suggest a correction. This is not very difficult to understand intuitively - informing the agent of its error indeed conveys more information than teaching it a completely correct answer, because the latter forces the network to "train" its memory from a sample which is, at least seemingly, insulated from its prior memory.

= Conclusion =
In this paper, a human teacher is enabled to provide feedback to the learning agent in the form of natural language. The authors focused on the problem of image captioning. They proposed a hierarchical phrase-based RNN as the captioning model, which allowed natural integration with human feedback.
They also crowd-sourced feedback, and showed how to incorporate it in policy gradient optimization.

= Comments =
In the hierarchical phrase-based RNN, human involving is a key part of improving the performance of the network. According to this paper, the feedback LSTTM network is capable of handling simple sentences. What if the feedback is weak or even ambiguous? Is there a threshold for the feedback such that the network can refuse a wrong feedback? Follow this architecture, it would be interesting to see whether such feedback strategy can be applied in machine translation.

= References=
[1] Huan Ling and Sanja Fidler. Teaching Machines to Describe Images via Natural Language Feedback. In arXiv:1706.00130, 2017.

[2] Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea Lockerd Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In NIPS, 2013.

[3] K. Judah, S. Roy, A. Fern, and T. Dietterich. Reinforcement learning via practice and critique advice. In AAAI, 2010.

[4] A. Thomaz and C. Breazeal. Reinforcement learning with human teachers: Evidence of feedback and guidance. In AAAI, 2006.

[5] Richard Maclin and Jude W. Shavlik. Incorporating advice into agents that learn from reinforcements. In National Conference on Artificial Intelligence, pages 694–699, 1994.

[6] Jason Weston. Dialog-based language learning. In arXiv:1604.06045, 2016.

[7] Russell Kaplan, Christopher Sauer, and Alexander Sosa. Beating atari with natural language guided reinforcement learning. In arXiv:1704.05539, 2017.

[8] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, pages 278–287, 1999.

[9] COCO Dataset http://cocodataset.org/#home

[10] https://en.wikipedia.org/wiki/BLEU

[11] Papineni, K., Roukos, S., Ward, T., Henderson, J and Reeder, F. (2002). “Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results” in Proceedings of Human Language Technology 2002, San Diego, pp. 132–137

[12] Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006 pp. 249–256

[13] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, Kevin Murphy. "Improved Image Captioning via Policy Gradient optimization of SPIDEr". Under review for ICCV 2017.

[14] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille. "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)".arXiv:1412.6632

File:teaching 1.PNG

2017-11-16T01:56:41Z

A2prasad:

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

2017-11-16T01:10:09Z

A2prasad: /* Remarks */

= Introduction =

The study of natural language processing has been around for more than fifty years. It begins in the 1950s which the specific field of natural language processing (NLP) is still embedded in the subject of linguistics (Hirschberg & Manning, 2015). After the emergence of strong computational power, computational linguistics began to evolve and gradually branch out to various applications in NLP, such as text classification, speech recognition and question answering (Brownlee, 2017). Computational linguistics or natural language processing is usually defined as “subfield of computer science concerned with using computational techniques to learn, understand, and produce human language content” (Hirschberg & Manning, 2015, p. 261).

With the development of deep neural networks, one type of neural network, namely recurrent neural networks (RNN) have performed significantly well in many natural language processing tasks. The reason is that nature of RNN takes into account the past inputs as well as the current input without resulting in vanishing or exploding gradient. More detail of how RNN works in the context of NLP will be discussed in the section of NLP using RNN. However, one limitation of RNN used in NLP is its enormous size of input vocabulary. This will result in a very complex RNN model with too many parameters to train and makes the training process both time and memory-consuming. This serves as the major motivation for this paper’s authors to develop a new technique utilized in RNN, which is particularly efficient at processing large size of vocabulary in many NLP tasks, namely LightRNN.

= Motivations =

In language modelling, researchers used to represent words by arbitrary codes, such as “Id143” is the code for “dog” (“Vector Representations of Words,” 2017). Such coding of words is completely random, and it loses the meaning of the words and (more importantly) connection with other words. Nowadays, one-hot representation of words is commonly used, in which a word is represented by a vector of numbers and the dimension of the vector is related to the size of the vocabulary. In RNN, all words in the vocabulary are coded using one-hot representation and then mapped to an embedding vector (Li, Qin, Yang, Hu, & Liu, 2016). Such embedding vector is “a continuous vector space where semantically similar words are mapped to nearby points” (“Vector Representations of Words” 2017, para. 6). Popular RNN structure used in NLP task is long short-term memory (LSTM). In order to predict the probability of the next word, the last hidden layer of the network needs to calculate the probability distribution over all other words in the vocabulary. Note that the most time-consuming operation in RNNs is to calculate a probability distribution over all the words in the vocabulary, which requires the multiplication of the output-embedding matrix and the hidden state at each position of a sequence. Lastly, an activation function (commonly, softmax function) is used to select the next word with the highest probability.
This method has 3 major limitations:

1. Memory Constraint

When input vocabulary contains an enormous amount of unique words, which is very common in various NLP tasks, the size of model becomes very large. This means the number of trainable parameters is very big, which makes it difficult to fit such model on a regular GPU device.

2. Computationally Heavy to Train

As previously mentioned, the probability distribution of all other words in the vocabulary needs to be computed to determine what predicted word it would be. When the size of the vocabulary is large, such calculation can be computationally heavy.

3. Low Compressibility

Due to the memory and computation-consuming process of RNN applied in NLP tasks, mobile devices cannot usually handle such algorithm, which makes it undesirable and limits its usage.

Previously, there were some works focusing on reducing the computing complexity in Softmax layer. By building a hierarchical binary tree where each node stands for a word, the time complexity is reduced to $\log(|V|)$. However, the space complexity remains same. In addition, some technicals, such as Character-level convolution filters, tried to reduce the model size by shrinking the input-embedding matrix, whereas brings no improvement in terms of speed.

An alternative approach to handle the overhead is by leveraging weak lower-level learners by Boosting. But a drawback is that this technique has only been implemented for a few specific tasks in the past such as Time Series predictions [Boné et. al.].

= LightRNN Structure =

The authors of the paper proposed a new structure that effectively reduces the size of the model by arranging all words in the vocabulary into a word table, which is referred as “2-Component (2C) shared embedding for word representation”. This is done by factorizing a vocabulary's embedding into two shared components (row and column). Thus, a word is indexed by its location in such table, which in terms is characterized by the corresponding row and column components. Each row and column component are unique row vector and column vector respectively. By organizing each word in the vocabulary in this manner, multiple words can share the same row component or column component and it can reduce the number of trainable parameters significantly.
The next question is how to construct such word table. More specifically, how to allocate each word in the vocabulary to different positions so that semantically similar words are in the same row or column. The authors proposed a bootstrap method to solve this problem. Essentially, we first randomly distribute words into the table. Then, we let the model “learn” better position of each word by minimizing training error. By repeating this process, each word can be allocated to a particular position within the table so that similar words share common row or column components. More details of those 2 parts of LightRNN structure will be discussed in the following sections.

There are 2 major benefits of the proposed technique:

1. Computationally efficient

The name “LightRNN” is to illustrate the small model size and fast training speed. Because of these features of the new RNN architecture, it’s possible to launch such model onto regular GPU and other mobile devices.

2. Higher scalability

The authors briefly explained this algorithm is scalable because if parallel-computing is needed to train such model, the difficulty of combining smaller models is low.

== Part I: 2-Component Shared Embedding ==

The key aspect of LightRNN structure is its innovative method of word representation, namely 2-Component Shared Embedding. All words in the vocabulary are organized into a table with row components and column components. Each pair of the element in a row component and a column component is corresponding to a unique word in the vocabulary. For instance, the <math>i^{th}</math> row and <math>j^{th}</math> column are the row and column indexes for <math>X_{ij}</math>. As shown in the following graph, <math>x_{1}</math> is corresponding to the words “January”. In 2C shared embedding table, it’s indexed by 2 elements: <math>x^{r}_{1}</math> and <math>x^{c}_{1}</math> where the subscript indicates which row component and column component this word belongs to. Ideally, words that share similar semantic features should be assigned to the same row or column. The shared embedding word table in Figure 1 serves as a good example: the word “one” and “January” are assigned to the same column, while the word “one” and “two” are allocated to the same row.

[[File:2C shared embedding.png|700px|thumb|centre|Fig 1. 2-Component Shared Embedding for Word Representation]]

The main advantage of using such word representation is it reduces the number of vector/element needed for input word embedding. For instance, if there are 25 unique words in the vocabulary, the number of vectors to represent all the words is 10, namely 5 row vectors/elements and 5 column vectors/elements. Therefore, the shared embedding word table is a 5 by 5 matrix. In general, the formula for calculating number of vector/element needed to represent <math>|V|</math> words is <math>2\sqrt{|V|}</math>.

== Part II: How 2C Shared Embedding is Used in LightRNN ==

After constructing such word representation table, those 2-component shared embedding matrices are fed into the recurrent neural network. The following Figure 2 demonstrates a portion of LightRNN structure (left) with comparison with the regular RNN (right). Compared to regular RNN where a single input <math>x_{t-1}</math> is fed into the network each time, 2 elements of a single input <math>x_{t-1}</math>: <math>x^{r}_{t-1}</math> and <math>x^{c}_{t-1}</math> are fed into LightRNN.

[[File:LightRNN.PNG |700px|thumb|centre|Fig 2. LightRNN Structure & Regular RNN]]

As mentioned before, the last hidden layer will produce the probabilities of <math>word_{t}</math>. Based on the diagram below, the following formulas are used:
Let $n$ be the dimension/length of a row input vector/a column input vector, <math>X^{c}, X^{r} \in \mathbb{R}^{n \times \sqrt{|V|}}</math> denotes the input-embedding matrices:
<center>
: row vector <math>x^{r}_{t-1} \in \mathbb{R}^n</math>
: column vector <math>x^{c}_{t-1} \in \mathbb{R}^n</math>
</center>

Let <math>h^{c}_{t-1}, h^{r}_{t-1} \in \mathbb{R}^m</math> denotes the two hidden layers where m = dimension of the hidden layer:
<center>
: <math>h^{c}_{t-1} = f(W x_{t-1}^{c} + U h_{t-1}^{r} + b) </math>
: <math>h^{r}_{t} = f(W x_{t}^{r} + U h_{t-1}^{c} + b) </math>
</center>
where <math>W \in \mathbb{R}^{m \times n}</math>, <math>U \in \mathbb{R}^{m \times m}</math>, and <math>b \in \mathbb{R}^m</math> and <math>f</math> is a nonlinear activation function

The final step in LightRNN is to calculate <math>P_{r}(w_{t})</math> and <math>P_{c}(w_{t})</math> , which means the probability of a word w at time t, using the following formulas:
<center>
: <math>P_{r}(w_t) = \frac{exp(h_{t-1}^{c} y_{r(w)}^{r})}{\sum\nolimits_{i \in S_r} exp(h_{t-1}^{c} y_{i}^{r}) }</math>
: <math>P_{c}(w_t) = \frac{exp(h_{t}^{r} y_{c(w)}^{c})}{\sum\nolimits_{i \in S_c} exp(h_{t}^{r} y_{i}^{c}) }</math>
: <math> P(w_t) = P_{r}(w_t) P_{c}(w_t) </math>
</center>
where
<center>
:<math> r(w) </math> = row index of word w
:<math> c(w) </math> = column index of word w
:<math> y_{i}^{r} \in \mathbb{R}^m </math> = i-th vector of <math> Y^r \in \mathbb{R}^{m \times \sqrt{|V|}}</math>
:<math> y_{i}^{c} \in \mathbb{R}^m </math> = i-th vector of <math> Y^c \in \mathbb{R}^{m \times \sqrt{|V|}}</math>
:<math> S_r </math> = the set of rows of the word table
:<math> S_c </math> = the set of columns of the word table
</center>

We can see that by using above equation, we effectively reduce the computation of the probability of the next word from a $|V|$-way normalization (in standard RNN models) to two $\sqrt {|V|}$-way normalizations. Note that we don't see the t-th word before predicting it. So in the above diagram, given the input column vector <math>x^c_{t-1} </math> of the (t-1)-th word, we first infer the row probability <math>P_r(w_t)</math> of the t-th word, and then choose the index of the row the largest probability in <math>P_r(w_t)</math> to look up the next input row vector <math>x^r_{t} </math>. Similarly, we can infer the column probability <math>P_c(w_t)</math> of the t-th word.

Essentially, in LightRNN, the prediction of the word at time t (<math> w_t </math>) based on word at time t-1 (<math> w_{t-1} </math>) is achieved by selecting the index <math> r </math> and <math> c </math> with the highest probabilities <math> P_{r}(w_t) </math>, <math> P_{c}(w_t) </math>. Then, the probability of each word is computed based on the multiplication of <math> P_{r}(w_t) </math> and <math> P_{c}(w_t) </math>.

== Part III: Bootstrap for Word Allocation ==

As mentioned before, the major innovative aspect of LightRNN is the development of 2-component shared embedding. Such structure can be used in building a recurrent neural network called LightRNN. However, how should such word table representation be constructed is the key part of building a successful LightRNN model. In this section, the procedures of constructing 2C shared embedding structure is explained.
The fundamental idea is using bootstrap method by minimizing a loss function (namely, negative log-likelihood function). The detailed procedures are described as the following:

Step 1: First, all words in a vocabulary are randomly assigned to individual position within the word table

Step 2: Train LightRNN model based on word table produced in step 1 until certain criteria are met

Step 3: By fixing the training results of input and output embedding matrices (W & U) from step 2, adjust the position of words by minimizing the loss function over all the words. Then, repeat from step 2

The authors presented the overall loss function for word w moving to position [i, j] using a negative log-likelihood function (NLL) as the following:
<center>
<math> NLL = \sum\limits_{t=1}^T -logP(w_t) = \sum\limits_{t=1}^T -log[P_{r}(w_t) P_{c}(w_t)] = \sum\limits_{t=1}^T -log[P_{r}(w_t)] – log[P_{c}(w_t)] = \sum\limits_{w=1}^{|V|} NLL_w </math>
</center>
where <math> NLL_w </math> is the negative log-likelihood of a word w.

Since in 2-component shared embedding structure, a word (w) is represented by one row vector and one column vector, <math> NLL_w </math> can be rewritten as <math> l(w, r(w), c(w)) </math> where <math> r(w) </math> and <math> c(w) </math> are the position index of word w in the word table. Next, the authors defined 2 more terms to explain the meaning of <math> NLL_w </math>: <math> l_r(w,r(w)) </math> and <math> l_c(w,c(w)) </math>, namely the row component and column component of <math> l(w, r(w), c(w)) </math>. The above can be summarised by the following formulas:
<center>
<math> NLL_w = \sum\limits_{t \in S_w} -logP(w_t) = l(w, r(w), c(w)) </math> 
<math> = \sum\limits_{t \in S_w} -logP_r(w_t) + \sum\limits_{t \in S_w} -logP_c(w_t) = l_r(w,r(w)) + l_c(w,c(w))</math> 
<math> = \sum\limits_{t \in S_w} -log (\frac{exp(h_{t-1}^{c} y_{i}^{r})}{\sum\nolimits_{k} exp(h_{t-1}^{c} y_{i}^{k})}) + \sum\limits_{t \in S_w} -log (\frac{exp(h_{t}^{r} y_{j}^{c})}{\sum\nolimits_{k} exp(h_{t}^{r} y_{k}^{c}) }) </math> 
where <math> S_w </math> is the set of all possible positions within the word table
</center>
In summary, the overall loss function for word w to move to position [i, j] is the sum of its row loss and column loss of moving to position [i, j]. Therefore, total loss of moving to position [i, j] <math> l(w, i, j) = l_r(w, i) + l_c(w, j)</math>. Thus, to update the table by reallocating each word, we are looking for position [i, j] for each word w that minimize the total loss function, mathematically written as for the following:
<center>
<math> \min\limits_{a} \sum\limits_{w,i,j} l(w,i,j)a(w,i,j) </math> such that 
<math> \sum\limits_{(i,j)} a(w,i,j) = 1 \space \forall w \in V, \sum\limits_{(w)} a(w,i,j) = 1 \space \forall i \in S_r, j \in S_j</math> 
<math> a(w,i,j) \in {0,1}, \forall w \in V, i \in S_r, j \in S_j</math> 
where <math> a(w,i,j) =1 </math> indicates moving word w to position [i, j]
</center>

After calculating $l(w, i, j)$ for all possible $w, i, j$, the above optimization leads forcing $a(w, i, j)$ to be equal to 1 for $i, j$ in which $l(w, i, j)$ is minimum and 0 elsewhere (i.e. finding the best place for the word $w$ in the table).

= LightRNN Example =

After describing the theoretical background of the LightRNN algorithm, the authors applied this method to 2 datasets (2013 ACL Workshop Morphological Language Dataset (ACLW) & One-Billion-Word Benchmark Dataset (BillonW)) and compared its performance with several other state-of-the-art RNN algorithms. The following table shows some summary statistics of those 2 datasets:

[[File:Table1YH.PNG|700px|thumb|centre|Table 1. Summary Statistics of Datasets]]

The goal of a probabilistic language model is either to compute the probability distribution of a sequence of given words (e.g. <math> P(W) = P(w_1, w_2, … , w_n)</math>) or to compute the probability of the next word given some previous words (e.g. <math> P(w_5 | w_1, w_2, w_3, w_4)</math>) (Jurafsky, 2017). In this paper, the evaluation matrix for the performance of LightRNN algorithm is perplexity <math> PPL </math> which is defined as the following:
<center>
<math> PPL = exp(\frac{NLL}{T})</math> 
where T = number of tokens in the test set
</center>

Based on the mathematical definition of PPL, a well-performed model will have a lower perplexity.
The authors then trained “LSTM-based LightRNN using stochastic gradient descent with truncated backpropagation through time” (Li, Qin, Yang, Hu, & Liu, 2016). To begin with, the authors first used the ACLW French dataset to determine the size of embedding matrix. From the results shown in Table 2, larger embedding size corresponds to higher accuracy rate (expressed in terms of perplexity). Therefore, they adopted embedding size of 1000 to be used in LightRNN to analyze the ACLW datasets.

[[File:Table2YH.PNG|700px|thumb|centre|Table 2. Testing PPL of LightRNN on ACLW-French dataset w.r.t. embedding size]]

* In the official implement Github repo, Figure 3 shows the training process of LightRNN on ACLW-French dataset.
[[File:ACLWFR.png|700px|thumb|centre|Figure 3.. Training process on ACLW-French]]

'''Advantage 1: small model size'''

One of the major advantages of using LightRNN on NLP tasks is significantly reduced model size, which means fewer number of parameters to estimate. By comparing LightRNN with two other RNN algorithms and the baseline language model with Kneser-Ney smoothing. Those two RNN algorithms are: HSM which uses LSTM RNN algorithm with hierarchical softmax for word prediction; C-HSM which uses both hierarchical softmax and character-level convolutional filters for input embedding. From the results table shown below, we can see that LightRNN has the lowest perplexity while keeping the model size significantly smaller compared to the other three algorithms.

[[File:Table5YH.PNG|700px|thumb|centre|Table 3. PPL Results in test set on ACLW datasets]]
Italic results are the previous state-of-the-art. #P denotes the number of parameters.

'''Advantage 2: high training efficiency'''

Another advantage of LightRNN model is its shorter training time while maintaining same level of perplexity compared to other RNN algorithms. When comparing to both C-HSM and HSM (shown below in Table 4), LightRNN only takes half the runtime but achieve same level of perplexity when applied to both ACLW and BillionW datasets. In the last column of Table 3, the amount of time used for word table reconstruction is presented as the percentage of the total runtime. As we can see, the training time for word reallocation takes up only a very small proportion of the total runtime. However, the resulting reconstructed word table can be used as a valuable output, which is further explained in the next section.

[[File:Table3YH.PNG|700px|thumb|centre|Table 4. Runtime comparisons in order to achieve the HSMs’ baseline PPL]]

'''Advantage 3: semantically valid word allocation table'''

As explained in the previous section, LightRNN uses a word allocation table that gets updated in every iteration of the algorithm. The optimal structure of the table should assign semantically similar words onto the same row or column in order to reduce the number of parameters to estimate. Below is a snapshot of the reconstructed word table used in LightRNN algorithm. Evidently, we can see in row 887, all URL addresses are grouped together and in row 872 all verbs in past tense are grouped together. As the authors explained in the paper, LightRNN doesn’t assume independence of each word but instead using a shared embedding table. In this way, it reduces the model size by utilizing common embedding elements of the table/matrix, and also uses such preprocessed data to improve the efficiency of this algorithm.

[[File:Table6YH.PNG|700px|thumb|centre|Table 6. Sample Word Allocation Table]]

= Remarks =

In summary, the proposed method in this paper is mainly on developing a new way of using word embedding. Words with similar semantic meanings are embedded using similar vectors. Those vectors are then divided into row and column components where similar words are grouped together by having shared row and column components in the word representation table. Thus from a computational and application perspective there were two key contributions provided in this paper.

#1. Reduction in size of word embedding matrix.
#2. Reduction in computations of word probabilities.

These two points ensures that one does not need hierarchical softmax or Monte carlo estimations of the model's training cost.
This is indeed a dimensional reduction, i.e. use the row and column "semantic vectors" to approximate the coded word. Because of this structural change of input word embedding, RNN model needs to adapt by having both row and column components being fed into the network. However, the fundamental structure of RNN model does not changed. Therefore, personally, I would say it’s a new word embedding technique rather than a new development in model construction. One major confusion I have when reading this paper is how those row and column components in the word allocation table are determined. From the paper itself, the authors didn’t explain how they are constructed.

Such shared word embedding technique is prevalently used in NLP. For instance, in language translation, similar words from different languages are grouped together so that the machine can translate sentences from one language to another. In Socher et al. (2013a), English and Chinese words are embedded in the same space so that we can find similar English (Chinese) words for Chinese (English) words. (Zou, Socher, Cer, & Manning, 2013). Word2vec is also a commonly used technique for word embedding, which uses a two-layer neural network to transform text into numeric vectors where similar words will have similar numeric values. The key feature of word2vec is that semantically similar words (which is now represented by numeric vectors) can be grouped together (“Word2vec,” n.d.; Bengio, Ducharme, & Vincent, 2001; Bengio, Ducharme, Vincent, & Jauvin, 2003).

An interesting area of further exploration proposed by the authors is an extension of this method to k-component shared embeddings where k>2. Words probably share similar semantic meanings in more than two dimensions, and this extension could reduce network size even further. However, it could also further complicate the bootstrapping phase of training.

Since no assumptions were made about the structure of the words, one could seek uses of this algorithm outside the context of natural language processing.

Code for LightRNN can be found on Github :

Official Implementation(CNTK): https://github.com/Microsoft/CNTK/tree/master/Examples/Text/LightRNN

Tensorflow : https://github.com/YisenWang/LightRNN-NIPS2016-Tensorflow_code

= Reference =
Bengio, Y, Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model. In Journal of Machine Learning Research (Vol. 3, pp. 932–938). https://doi.org/10.1162/153244303322533223

Bengio, Yoshua, Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb), 1137–1155.

Brownlee, J. (2017, September 20). 7 Applications of Deep Learning for Natural Language Processing. Retrieved October 27, 2017, from https://machinelearningmastery.com/applications-of-deep-learning-for-natural-language-processing/

Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266. https://doi.org/10.1126/science.aaa8685

Jurafsky, D. (2017, January). Language Modeling Introduction to N grams. Presented at the CS 124: From Languages to Information, Stanford University. Retrieved from https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Li, X., Qin, T., Yang, J., Hu, X., & Liu, T. (2016). LightRNN: Memory and Computation-Efficient Recurrent Neural Networks. Advances in Neural Information Processing Systems 29, 4385–4393.

Recurrent Neural Networks. (n.d.). Retrieved October 8, 2017, from https://www.tensorflow.org/tutorials/recurrent

Vector Representations of Words. (2017, August 17). Retrieved October 8, 2017, from https://www.tensorflow.org/tutorials/word2vec

Word2vec. (n.d.). Retrieved October 26, 2017, from https://deeplearning4j.org/word2vec.html

Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation, 1393–1398.

Kneser Ney Smoothing - : https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing & http://www.foldl.me/2014/kneser-ney-smoothing/

Boné R., Assaad M., Crucianu M. (2003) Boosting Recurrent Neural Networks for Time Series Prediction. In: Pearson D.W., Steele N.C., Albrecht R.F. (eds) Artificial Neural Nets and Genetic Algorithms. Springer, Vienna

Imagination-Augmented Agents for Deep Reinforcement Learning

2017-11-14T19:10:37Z

A2prasad: /* A Training and the rollout policy distribution details */

=Introduction=
An interesting research area in reinforcement learning is developing intelligent agents for playing video games. Before the introduction of deep learning, video game agents were commonly coded based on Monte-Carlo Tree Search(MCTS) of pre-set rules. MCTS is used for making optimal decisions in artificial intelligence problems, and the focus is on the analysis of the most promising moves. The basic algorithm is selection, expansion, simulation, and backpropagation. Recent research has shown deep reinforcement learning to be very successful at playing video games like Atari 2600. To be specific, the method (Figure 1) is called Deep Q-Learning (DQN) which learns the optimal actions based on current observations (raw pixels) [[#Reference|[Mnih et al., (2015)]]]. However, there are some complex games where DQN fails to learn: some games need to solve a sub-problem without explicit reward or contain irreversible domains, where actions can be catastrophic. A typical example of these games is [https://en.wikipedia.org/wiki/Sokoban Sokoban]. Similar to how humans play the game, RL model needs planning and inference. This kind of game raises challenges to RL.

[[File:DQN.png|800px|center|thumb|Figure 1: Deep Q-Learning Architecture]]

In Reinforcement Learning, the algorithms can be divided into two categories: '''model-free''' algorithm and '''model-based''' algorithm. The model-based reinforcement learning tries to infer environment to gain the reward while model-free reinforcement learning does not use the environment to learn the action that results in the best reward. More specifically, model-based methods learn the model (the reward function: $R(s, s^{'})$ and the Transition probability $P(s^{'} | s, a)$ where $s', s$ and $a$ are next state, current state and action respectively.) of the environment, while model-free methods never explicitly learn the model of the environment. DQN, mentioned above(Figure 1), is a model-free method. It takes raw pixels as input and maps them to values or actions. As a drawback, large amounts of training data is required. In addition, the policies are not generalized to new tasks in the same environment. A model-based method is trying to build a model for the environment. By querying the model, agents can avoid irreversible, poor decisions. As an approximation of the environment, it can enable better generalization across states. However, this method only shows success in limited settings, where an exact transition model is given or in simple domains. In complex environments, model-based methods suffer from model errors from function approximation. These errors compound during planning, causing poor agent performance. Currently, there is no model-based method that is robust against imperfections.

In this paper, the authors introduce a novel deep reinforcement learning architecture called Imagination-Augmented Agents (I2As). Literally, this method enables agents to learn to interpret predictions from a learned environment model to construct implicit plans. It is a combination of model-free and model-based aspects. The advantage of this method is that it learns in an end-to-end way to extract information from model simulations without making any assumptions about the structure or the perfections of the environment model.
As shown in the results, this method outperforms DQN in the games: Sokoban, and MiniPacman. In addition, the experiments all show that I2A is able to successfully use imperfect models.

=Motivation=
A capability to "imagine" and reason about the future is an important property of an intelligent and sophisticated RL algorithms. Beyond that, they must be able to construct a plan using this knowledge. In a model-based approach, "internal model" is used to analyze how actions lead to future outcomes in order to reason and plan. These internal models work so well because provided environments are generally "perfect" - they have clearly defined rules which allow outcomes to be predicted very accurately in almost every circumstance. But the real world is complex, rules are not so clearly defined and unpredictable problems often arise. Even for the most intelligent agents, imagining in these complex environments is a long and costly process. Hence this paper puts forward an idea of combining the model-free and model-based approach that could work under complex situations using imagination augmentation. Although the structure of this method is complex, the motivation is intuitive: since the agent suffers from irreversible decisions, attempts in simulated states may be helpful. To improve the expensive search space in traditional MCTS methods, adding decision from policy network can reduce search steps. In order to keep context information, rollout results are encoded by an LSTM encoder. The final output is combining the result from the model-free network and model-based network.

=Related Work=
There are some works that try to apply deep learning to model-based reinforcement learning. The popular approach is to learn a neural network from the environment and apply the network in classical planning algorithms. These works can not handle the mismatch between the learned model and the ground truth. [[#Reference|[Liu et al.(2017)]]] use context information from trajectories, but in terms of imitation learning.

To deal with imperfect models, [[#Reference|[Deisenroth and Rasmussen(2011)]]] try to capture model uncertainty by applying high-computational Gaussian Process models. In order to develop such a policy search method, the authors of this paper used analytic gradients of an approximation to the expected return for indirect policy search. This means by learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, this policy search method can cope with very little data and facilitates learning from scratch in only a few trials.

Similar ideas can be found in a study by [[#Reference|[Hamrick et al.(2017)]]]: they present a neural network that queries expert models, but just focus on meta-control for continuous contextual bandit problems. Pascanu et al.(2017) extend this work by focusing on explicit planning in sequential environments.

This paper claims to build upon the work of [[#Reference|[Tamar et al. (2016)]]]. In these works, neural networks whose architectures mimic classical iterative planning algorithms are presented. Such models are trained by reinforcement learning or to predict user-defined, high-level features. The authors did not define any explicit environment model.

=Approach=
The summary of the architecture of I2A can be seen in Figure 2.
[[File:i2a.png|800px|center|thumb|Figure 2: The Architecture of I2A]]
The observation $O_t$ (Figure 2 right) is fed into two paths, the model-free path is just common DQN which predicts the best action given $O_t$, whereas the model-based path performs a rollout strategy, the aggregator combines the $n$ rollout encoded outputs($n$ equals to the number of actions in the action space), and forwards the results to next layer. Together they are used to generate a policy function $\pi$ to output an action. In each rollout operation, the imagination core is used to predict the future state and reward.

===Imagination Core===
The imagination-augmented agents adopt a concept called the "imagination encoder", which is a neural network which learns to extract relevant information that impacts the agent's future decisions, and ignores information that is irrelevant. In particular, these agents have the following features: (i) they have the ability to learn to interpret their internal simulations which captures the environmental dynamics, (ii) they adapt to the number of imagined trajectors which makes the imagination more efficient, and finally (iii) they have the ability to learn different strategies to construct plans by choosing the appropriate trajectory. The imagination core(Figure 2 left) is the key role in the model-based path. It consists of two parts: environment model and rollout policy. The former is an approximation of the environment and the latter is used to simulate imagined trajectories, which are interpreted by a neural network and provided as additional context to a policy network.

====environment model====
In order to augment agents with imagination, the method relies on environment models that, given current information, can be queried to make predictions about the future. In this work, the environment model is built based on action-conditional next-step predictors, which receive input contains current observation and current action, and predict the next observation and the next reward(Figure 3).
[[File:environment model.png|800px|center|thumb|Figure 3: Environment Model]]

The authors can either pretrain the environment model before embedding it (with frozen weights) within the I2A architecture or jointly train it with the agent by adding $l_{model}$ to the total loss as an auxiliary loss. In practice, they found that pre-training the environment model led to faster
runtime of the I2A architecture, so they adopted this strategy.

====rollout policy====
The rollout process is regarded as the simulated trajectories. In this work, the rollout is performed for each possible action in the environment.

A rollout policy $\hat \pi$ is a function that takes current observation $O$ and outputs an action $a$ that potentially leads to maximal reward. In this architecture, the rollout policy can be a DQN network. In the experiment, the rollout policy $\hat \pi$ is broadcasted and shared. After experiments on the types of rollout policies(random, pre-trained), the authors found the efficient strategy is to distill the policy into a model-free policy, which consists in creating a small model-free network $\hat \pi(O_t)$, and adding to the total loss a cross entropy auxiliary loss between the imagination-augmented policy $\pi(O_t)$ as computed on the current observation, and the policy $\hat \pi(O_t)$ as computed on the same observation.

$$
l_{dist} (\pi, \hat \pi)(O_t) = \lambda_{dist} \sum_a \pi(a|O_t)log(\hat \pi(a|O_t))
$$

Together as the imagination core, these two parts produces $n$ trajectories $\hat \tau_1,...,\hat \tau_n$. Each imagined trajectory $\hat \tau$ is a sequence of features $(\hat f_{t+1},...,\hat f_{t+\tau})$, where $t$ is the current time, $\tau$ the length of rollout, and $\hat f_{t+i}$ the output of the environment model(the predicted observation and reward). In order to guarantee success in imperfections, the architecture does not assume the learned model to be perfect. The output will not only depend on the predicted reward.

===Trajectories Encoder===
From the intuition to keep the sequence information in the trajectories, the architecture uses a rollout encoder $\varepsilon$ that processes the imagined rollout as a whole and learns to interpret it(Figure 2 middle). Each trajectory is encoded as a rollout embedding $e_i=\varepsilon(\hat \tau_i)$. Then, the aggregator $A$ combines the rollout embedding s into a single imagination code $c_{ia}=A(e_1,...,e_n)$ by simply concatenating all the summaries.
In the experiments, the encoder is an LSTM that takes the predicted output from environment model as the input. One observation is that the order of the sequence $\hat f_{t+1}$ to $\hat f_{t+\tau}$ makes relatively little impact on the performance. The encodes mimics the Bellman type backup operations in DQN.
An alternative attempt would be to combine guided policy search with the linear quadratic regulator[7], which is coincidentally a joint model-free and model-based trajectory update mechanism for reinforcement learning.

===Model-Free Path===
The model-free path contains a network that only takes the current observation as input that generates the potential optimal action. This network can be same as the one in imagination core.

In conclusion, the I2A learns to combine information for two paths, and without the model-based path, I2A simply reduce to a standard model-free network(such as A3C, more explanations [https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2 here]). The imperfect approximation results in a rollout policy with higher entropy, potentially striking a balance between exploration and exploitation.

=Experiments=
These following experiments were tested in Sokoban and MiniPacman games. All results are averages taken from top three agents. These agents were trained over 32 to 64 workers, and the network was optimized by RMSprop.
As the pre-training strategy, the training data of I2A was pre-generated from trajectories of a partially trained standard model-free agent, the data is also taken into account for the budget. The total number of frames that were needed in pre-training is counted in the later process. Meanwhile, the authors show that the environment model can be reused to solve multiple tasks in the same environment.

In the game Sokoban, the environment is a 10 x 10 grid world. All agents were trained directly on raw pixels(image size 80 x 80 with 3 channels). To make sure the network is not just simply "memorize" all states, the game procedurally generates a new level each episode. Out of 40 million levels generated, less than 0.7% were repeated. Therefore, a good agent should solve the unseen level as well.

The reward settings for reinforcement learning algorithms are as follows:
* Every time step, a penalty of -0.1 is applied to the agent.(encourage agents to finish levels faster)
* Whenever the agent pushes a box on target, it receives a reward of +1.(encourage agents to push boxes onto targets)
* Whenever the agent pushes a box off target, it receives a penalty of -1.(avoid artificial reward loop that would be induced by repeatedly pushing a box off and on target)
* Finishing the level gives the agent a reward of +10 and the level terminates.(strongly reward solving a level)

To show the advantage of I2A, the authors set a model-free standard architecture as a baseline. The architecture is a multi-layer convolutional neural network (CNN), taking the current observation $O_t$ as input, followed by a fully connected (FC) hidden layer. This FC layer feeds into two heads: into an FC layer with one output per action computing the policy logits $\log \pi(a_t|O_t, \theta)$; and into another FC layer with a single output that computes the value function $V(O_t; \theta_v)$.
* for MiniPacman: the CNN has two layers, both with 3x3 kernels, 16 output channels and strides 1 and 2; the following FC layer has 256 units
* for Sokoban: the CNN has three layers with kernel sizes 8x8, 4x4, 3x3, strides of 4, 2, 1 and number of output channels 32, 64, 64; the following FC has 512 units

===Sokoban===

Sokoban is a video game which is classified as a transport puzzle. The game involves the player moving pieces of boxes to get them to their target locations in an aerial view. The boxes can only be pushed and many moves become irreversible if the player don't properly plan them, which might render the puzzle unsolvable. The player is confined to the board and may move horizontally or vertically onto empty squares (never through walls or boxes). The player can also move into a box, which pushes it into the square beyond. Boxes may not be pushed into other boxes or walls, and they cannot be pulled. The number of boxes is equal to the number of storage locations. The puzzle is solved when all boxes are at storage locations.

The environment model for Sokoban is shown in figure 4
[[File:sokoban_em.png|400px|center|thumb|Figure 4: The Sokoban environment model]]

Besides, to demonstrate the influence of larger architecture in I2A, the authors set a copy-model agent that uses the same architecture of I2A but the environment model is replaced by identical map. This agent is regarded as an I2A agent without imagination.

[[File:sokoban_result.png|800px|center|thumb|Figure 5: Sokoban learning curves. Left: training curves of I2A and baselines. Right: I2A training curves for various values of imagination depth]]
The results are shown in Figure 4(left). I2A agents can solve much more levels compared to common DQN. Also, it far outperforms the copy-model version, suggesting that the environment model is crucial. The authors also trained an I2A where the environment model was predicting no rewards, only observations. This also performed worse. However, after much longer training (3e9 steps), these agents did recover the performance of the original I2A, which was never the case for the baseline agent even with that many steps. Hence, reward prediction is very helpful but not absolutely necessary in this task, and imagined observations alone are informative enough to obtain high performance on Sokoban. Note this is in contrast to many classical planning and model-based reinforcement learning methods, which often rely on reward prediction.

====Length of Rollout====
A further experiment was investigating how the length of individual rollouts affects performance. The authors performed a parameter searching. Figure 5(right) shows the influence of the rollout length. The strategy using 3 rollout steps improves the speed of learning and improves the performance significantly than 1 step, and 5 is the optimal number. This implies rollout can be very helpful and informative. This rollout enables the agent to learn moves it cannot recover from.

[[File:sokoban_noisy.png|800px|center|thumb|Figure 6: Experiments with a noisy environment model Left: each row shows an example 5-step rollout after conditioning on an environment observation. Errors accumulate and lead to various artifacts, including missing or duplicate sprites. Right: comparison of Monte-Carlo (MC) search and I2A when using either the accurate or the noisy model for rollouts.]]

====Imperfections====
To demonstrate I2A can handle less reliable predictions, the authors set experiment where the I2A used a poor environment model(smaller number of parameters), where the error may accumulate across the rollout(Figure 6 left). The authors suggest that it is learning a rollout encoder that enables I2As to deal with imperfect model predictions. We can compare them to a setup without a rollout decoder. As shown in figure 6(right), even with relatively poor environment model, the performance of I2A is stable, unlike traditional Monte-Carlo search, which explicitly estimates the value of each action from rollouts, rather than learning an arbitrary encoding of the rollouts. An interesting result is that a rollout length 5 no longer outperforms a length of 3, which matches our common sense.

====Perfections====
As I2A shows the robustness towards environment models, the authors tested an I2A agent with a nearly perfect environment model, and the results are in Table 1 and Table 2. Traditional Mento-Carlo Tree Search is tested as the baseline. From the table, although it is able to solve many levels, the search steps are very huge. On the contrary, I2A with the nearly perfect model can achieve the same fraction with much fewer steps.

====Generalization====
Lastly, the authors probe the generalization capabilities of I2As, beyond handling random level layouts in Sokoban. The agents were trained on levels with 4 boxes. Table 2 shows the performance of I2A when such an agent was tested on levels with different numbers of boxes, and that of the standard model-free agent for comparison. It turns out that I2As generalizes well; at 7 boxes, the I2A agent is still able to solve more than half of the levels, nearly as many as the standard agent on 4 boxes.
[[File:i2a_table.png|800px|center|thumb]]

===MiniPacman===
MiniPacman is a game modified from the classical game PacMan. In the game(Figure 8, left), the player explores a maze that contains food while being chased by ghosts. The maze also contains power pills; when eaten, for a fixed number of steps, the player moves faster, and the ghosts run away and can be eaten. These dynamics are common to all tasks. Each task is defined by a vector $w \in R^5$, associating a reward to each of the following five events: moving, eating food, eating a power pill, eating a ghost, and being eaten by a ghost. As such, the reward vector wrew can be interpreted as an ‘instruction’ about which task to solve in the same environment.
The goal of this part is the attempt that tries to apply the same I2A model to different tasks. The five tasks are described as follows:
* Regular: level is cleared when all the food is eaten;
* Avoid: level is cleared after 128 steps;
* Hunt: level is cleared when all ghosts are eaten or after 80 steps.
* Ambush: level is cleared when all ghosts are eaten or after 80 steps.
* Rush: level is cleared when all power pills are eaten.

[[File:minipacman_reward.png|800px|center|thumb|Table 3: the reward settings in different tasks]]

Different from the task in Sokoban, in order to capture long-range dependencies across pixels, the authors also made use of a layer that is called pool-and-inject, which applies global max-pooling over each feature map and broadcasts the resulting values as feature maps of the same size and concatenates the result to the input. Pool-and-inject layers are therefore size-preserving layers which communicate the max-value of each layer globally to the next convolutional layer. The environment model for MiniPacman is shown in Figure 7.

[[File:minipacman_model.png|800px|center|thumb|Figure 7: The MiniPacman environment model]]

To illustrate the benefits of model-based methods in this multi-task setting, the authors trained a single environment model to predict both observations (frames) and events, where the environment model is effectively shared across all tasks. Results in Figure 7(right) illustrates the benefit of the I2A architecture, outperforming the standard agent in all tasks. Note that for tasks 4 & 5, the rewards are particularly sparse, and the anticipation of ghost dynamics is especially important. The I2A agent can leverage its environment and reward model to explore the environment much more effectively.

[[File:minipacman.png|800px|center|thumb|Figure 8: Minipacman environment Left: Two frames from a minipacman game: the player is green, dangerous ghosts red, food dark blue, empty corridors black, power pills in cyan. After eating a power pill (right frame), the player can eat the 4 weak ghosts (yellow). Right: Performance after 300 million environment steps for different agents and all tasks. Note I2A clearly outperforms the other two agents on all tasks with sparse rewards.]]

[[File:imagination-946.PNG]]

The training curves for the various experimental tasks described in this paper are provided in the figure above.

=Conclusion=
In this paper, the authors applied recent success in CNN and reinforcement learning and raised a novel approach, which is a combination of model-free and model-based methods, called Imagination-augmented RL. Unlike classical model-based RL and planning methods, I2A is able to successfully use imperfect models to support model-free decisions. This approach outperforms model-free baselines in the games, MiniPacman and on the challenging, combinatorial domain of Sokoban. As experiments suggest, this method is able to successfully use imperfect models to interpret future states and rewards.

I2As trade-off environment interactions for computation by pondering before acting and thus, the imagination core part is essential in irreversible domains, where actions can have catastrophic outcomes. Compared to traditional Monte-Carlo search methods, the search space in I2A only grows linearly with the extension of the length of rollouts whereas I2As require far fewer function calls. This work may significantly broaden the applicability of model-based RL concepts and ideas.

=Insight=
This is a paper with very interesting ideas. However, it seems that the work is really hard to reproduce for an individual researcher. Since the architecture works as a whole, it is very difficult to debug each single part. Meanwhile, the training process is kind of long with up to 1e9 steps, which is also a huge requirement for computing resources.

In terms of the architecture itself, the design the CNN for the tasks seems to be very empirical. The authors did not include the reasons or rules for this part. Yet why authors applied residual connection in this shadow network is unknown. According to the paper, even the CNN network is quite simple, some details in LSTM encoder are omitted. Therefore, the backpropagation process is not so clear across the whole model.

Back to the settings of environment model, the authors used pre-trained model instead of the jointly training way. Would it be hard to train both models simultaneously?

Lastly, the authors raised a new layer as Pool-and-inject layer, the motivation and plausibility are not so clear. It would be better if the authors can compare it with common pooling layer.

In spite of some missing details, this is a solid work with a novel idea and many tricks. In addition, the settings of the experiment are quite inspiring where we can learn from.

The use of memory networks instead of LSTM can alleviate the problem of remembering long-term rewards. Performing inference over the memory can lead to more accurate insight generation for internal simulations which is performed by the imagination augmented agents

=Reference=
# A commentary of the paper by the authors can be found on: https://www.youtube.com/watch?v=agXIYMCICcc
# Buesing, L., Badia, A.P., Battaglia, P.W., Guez, A., Heess, N., Li, Y., Pascanu, R., Racanière, S., Reichert, D.P., Rezende, D.J., Silver, D., Vinyals, O., Weber, T., & Wierstra, D. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. CoRR, abs/1707.06203.
# YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
# Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
# Jessica B. Hamrick, Andy J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
# Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, David Reichert, Theophane Weber, Sebastien Racaniere, Lars Buesing, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv preprint, 2017.
# Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
#Introduction to MCTS http://mcts.ai/about/index.html
#Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, Sergey Levine. "Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning". arxiv pre-print; arXiv:1703.03078 [cs.RO]
#Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
#YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
#Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
#Jessica B. Hamrick, Andy J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
#Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.

= Appendix =
This paper provides a rich appendix that expounds upon the authors implementation in much greater detail.
== A Training and the rollout policy distribution details ==
As in other reinforcement learning works each agent used in the paper defines a stochastic policy. While training the models, to increase the probability of an action being taken, A3C applies an update $\Delta \theta$ to the parameters $\theta$ using policy gradient $g(\theta)$:

$ g(\theta) = \nabla_{\theta}log(\pi)(a_{t}|o_{t};\theta)A(o_{t}; \theta_{v})$,

where $A(o_{t}; \theta_{v})$ denotes an estimate of the advantage function. We learn a value function $V(o_t;\theta_v)$ and hence use it to compute the advantage.

== C MiniPacman additional details ==
=== Task collection ===
[[File:task_collection.PNG]]

Imagination-Augmented Agents for Deep Reinforcement Learning

2017-11-14T19:05:09Z

A2prasad: /* Experiments */

=Introduction=
An interesting research area in reinforcement learning is developing intelligent agents for playing video games. Before the introduction of deep learning, video game agents were commonly coded based on Monte-Carlo Tree Search(MCTS) of pre-set rules. MCTS is used for making optimal decisions in artificial intelligence problems, and the focus is on the analysis of the most promising moves. The basic algorithm is selection, expansion, simulation, and backpropagation. Recent research has shown deep reinforcement learning to be very successful at playing video games like Atari 2600. To be specific, the method (Figure 1) is called Deep Q-Learning (DQN) which learns the optimal actions based on current observations (raw pixels) [[#Reference|[Mnih et al., (2015)]]]. However, there are some complex games where DQN fails to learn: some games need to solve a sub-problem without explicit reward or contain irreversible domains, where actions can be catastrophic. A typical example of these games is [https://en.wikipedia.org/wiki/Sokoban Sokoban]. Similar to how humans play the game, RL model needs planning and inference. This kind of game raises challenges to RL.

[[File:DQN.png|800px|center|thumb|Figure 1: Deep Q-Learning Architecture]]

In Reinforcement Learning, the algorithms can be divided into two categories: '''model-free''' algorithm and '''model-based''' algorithm. The model-based reinforcement learning tries to infer environment to gain the reward while model-free reinforcement learning does not use the environment to learn the action that results in the best reward. More specifically, model-based methods learn the model (the reward function: $R(s, s^{'})$ and the Transition probability $P(s^{'} | s, a)$ where $s', s$ and $a$ are next state, current state and action respectively.) of the environment, while model-free methods never explicitly learn the model of the environment. DQN, mentioned above(Figure 1), is a model-free method. It takes raw pixels as input and maps them to values or actions. As a drawback, large amounts of training data is required. In addition, the policies are not generalized to new tasks in the same environment. A model-based method is trying to build a model for the environment. By querying the model, agents can avoid irreversible, poor decisions. As an approximation of the environment, it can enable better generalization across states. However, this method only shows success in limited settings, where an exact transition model is given or in simple domains. In complex environments, model-based methods suffer from model errors from function approximation. These errors compound during planning, causing poor agent performance. Currently, there is no model-based method that is robust against imperfections.

In this paper, the authors introduce a novel deep reinforcement learning architecture called Imagination-Augmented Agents (I2As). Literally, this method enables agents to learn to interpret predictions from a learned environment model to construct implicit plans. It is a combination of model-free and model-based aspects. The advantage of this method is that it learns in an end-to-end way to extract information from model simulations without making any assumptions about the structure or the perfections of the environment model.
As shown in the results, this method outperforms DQN in the games: Sokoban, and MiniPacman. In addition, the experiments all show that I2A is able to successfully use imperfect models.

=Motivation=
A capability to "imagine" and reason about the future is an important property of an intelligent and sophisticated RL algorithms. Beyond that, they must be able to construct a plan using this knowledge. In a model-based approach, "internal model" is used to analyze how actions lead to future outcomes in order to reason and plan. These internal models work so well because provided environments are generally "perfect" - they have clearly defined rules which allow outcomes to be predicted very accurately in almost every circumstance. But the real world is complex, rules are not so clearly defined and unpredictable problems often arise. Even for the most intelligent agents, imagining in these complex environments is a long and costly process. Hence this paper puts forward an idea of combining the model-free and model-based approach that could work under complex situations using imagination augmentation. Although the structure of this method is complex, the motivation is intuitive: since the agent suffers from irreversible decisions, attempts in simulated states may be helpful. To improve the expensive search space in traditional MCTS methods, adding decision from policy network can reduce search steps. In order to keep context information, rollout results are encoded by an LSTM encoder. The final output is combining the result from the model-free network and model-based network.

=Related Work=
There are some works that try to apply deep learning to model-based reinforcement learning. The popular approach is to learn a neural network from the environment and apply the network in classical planning algorithms. These works can not handle the mismatch between the learned model and the ground truth. [[#Reference|[Liu et al.(2017)]]] use context information from trajectories, but in terms of imitation learning.

To deal with imperfect models, [[#Reference|[Deisenroth and Rasmussen(2011)]]] try to capture model uncertainty by applying high-computational Gaussian Process models. In order to develop such a policy search method, the authors of this paper used analytic gradients of an approximation to the expected return for indirect policy search. This means by learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, this policy search method can cope with very little data and facilitates learning from scratch in only a few trials.

Similar ideas can be found in a study by [[#Reference|[Hamrick et al.(2017)]]]: they present a neural network that queries expert models, but just focus on meta-control for continuous contextual bandit problems. Pascanu et al.(2017) extend this work by focusing on explicit planning in sequential environments.

This paper claims to build upon the work of [[#Reference|[Tamar et al. (2016)]]]. In these works, neural networks whose architectures mimic classical iterative planning algorithms are presented. Such models are trained by reinforcement learning or to predict user-defined, high-level features. The authors did not define any explicit environment model.

=Approach=
The summary of the architecture of I2A can be seen in Figure 2.
[[File:i2a.png|800px|center|thumb|Figure 2: The Architecture of I2A]]
The observation $O_t$ (Figure 2 right) is fed into two paths, the model-free path is just common DQN which predicts the best action given $O_t$, whereas the model-based path performs a rollout strategy, the aggregator combines the $n$ rollout encoded outputs($n$ equals to the number of actions in the action space), and forwards the results to next layer. Together they are used to generate a policy function $\pi$ to output an action. In each rollout operation, the imagination core is used to predict the future state and reward.

===Imagination Core===
The imagination-augmented agents adopt a concept called the "imagination encoder", which is a neural network which learns to extract relevant information that impacts the agent's future decisions, and ignores information that is irrelevant. In particular, these agents have the following features: (i) they have the ability to learn to interpret their internal simulations which captures the environmental dynamics, (ii) they adapt to the number of imagined trajectors which makes the imagination more efficient, and finally (iii) they have the ability to learn different strategies to construct plans by choosing the appropriate trajectory. The imagination core(Figure 2 left) is the key role in the model-based path. It consists of two parts: environment model and rollout policy. The former is an approximation of the environment and the latter is used to simulate imagined trajectories, which are interpreted by a neural network and provided as additional context to a policy network.

====environment model====
In order to augment agents with imagination, the method relies on environment models that, given current information, can be queried to make predictions about the future. In this work, the environment model is built based on action-conditional next-step predictors, which receive input contains current observation and current action, and predict the next observation and the next reward(Figure 3).
[[File:environment model.png|800px|center|thumb|Figure 3: Environment Model]]

The authors can either pretrain the environment model before embedding it (with frozen weights) within the I2A architecture or jointly train it with the agent by adding $l_{model}$ to the total loss as an auxiliary loss. In practice, they found that pre-training the environment model led to faster
runtime of the I2A architecture, so they adopted this strategy.

====rollout policy====
The rollout process is regarded as the simulated trajectories. In this work, the rollout is performed for each possible action in the environment.

A rollout policy $\hat \pi$ is a function that takes current observation $O$ and outputs an action $a$ that potentially leads to maximal reward. In this architecture, the rollout policy can be a DQN network. In the experiment, the rollout policy $\hat \pi$ is broadcasted and shared. After experiments on the types of rollout policies(random, pre-trained), the authors found the efficient strategy is to distill the policy into a model-free policy, which consists in creating a small model-free network $\hat \pi(O_t)$, and adding to the total loss a cross entropy auxiliary loss between the imagination-augmented policy $\pi(O_t)$ as computed on the current observation, and the policy $\hat \pi(O_t)$ as computed on the same observation.

$$
l_{dist} (\pi, \hat \pi)(O_t) = \lambda_{dist} \sum_a \pi(a|O_t)log(\hat \pi(a|O_t))
$$

Together as the imagination core, these two parts produces $n$ trajectories $\hat \tau_1,...,\hat \tau_n$. Each imagined trajectory $\hat \tau$ is a sequence of features $(\hat f_{t+1},...,\hat f_{t+\tau})$, where $t$ is the current time, $\tau$ the length of rollout, and $\hat f_{t+i}$ the output of the environment model(the predicted observation and reward). In order to guarantee success in imperfections, the architecture does not assume the learned model to be perfect. The output will not only depend on the predicted reward.

===Trajectories Encoder===
From the intuition to keep the sequence information in the trajectories, the architecture uses a rollout encoder $\varepsilon$ that processes the imagined rollout as a whole and learns to interpret it(Figure 2 middle). Each trajectory is encoded as a rollout embedding $e_i=\varepsilon(\hat \tau_i)$. Then, the aggregator $A$ combines the rollout embedding s into a single imagination code $c_{ia}=A(e_1,...,e_n)$ by simply concatenating all the summaries.
In the experiments, the encoder is an LSTM that takes the predicted output from environment model as the input. One observation is that the order of the sequence $\hat f_{t+1}$ to $\hat f_{t+\tau}$ makes relatively little impact on the performance. The encodes mimics the Bellman type backup operations in DQN.
An alternative attempt would be to combine guided policy search with the linear quadratic regulator[7], which is coincidentally a joint model-free and model-based trajectory update mechanism for reinforcement learning.

===Model-Free Path===
The model-free path contains a network that only takes the current observation as input that generates the potential optimal action. This network can be same as the one in imagination core.

In conclusion, the I2A learns to combine information for two paths, and without the model-based path, I2A simply reduce to a standard model-free network(such as A3C, more explanations [https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2 here]). The imperfect approximation results in a rollout policy with higher entropy, potentially striking a balance between exploration and exploitation.

=Experiments=
These following experiments were tested in Sokoban and MiniPacman games. All results are averages taken from top three agents. These agents were trained over 32 to 64 workers, and the network was optimized by RMSprop.
As the pre-training strategy, the training data of I2A was pre-generated from trajectories of a partially trained standard model-free agent, the data is also taken into account for the budget. The total number of frames that were needed in pre-training is counted in the later process. Meanwhile, the authors show that the environment model can be reused to solve multiple tasks in the same environment.

In the game Sokoban, the environment is a 10 x 10 grid world. All agents were trained directly on raw pixels(image size 80 x 80 with 3 channels). To make sure the network is not just simply "memorize" all states, the game procedurally generates a new level each episode. Out of 40 million levels generated, less than 0.7% were repeated. Therefore, a good agent should solve the unseen level as well.

The reward settings for reinforcement learning algorithms are as follows:
* Every time step, a penalty of -0.1 is applied to the agent.(encourage agents to finish levels faster)
* Whenever the agent pushes a box on target, it receives a reward of +1.(encourage agents to push boxes onto targets)
* Whenever the agent pushes a box off target, it receives a penalty of -1.(avoid artificial reward loop that would be induced by repeatedly pushing a box off and on target)
* Finishing the level gives the agent a reward of +10 and the level terminates.(strongly reward solving a level)

To show the advantage of I2A, the authors set a model-free standard architecture as a baseline. The architecture is a multi-layer convolutional neural network (CNN), taking the current observation $O_t$ as input, followed by a fully connected (FC) hidden layer. This FC layer feeds into two heads: into an FC layer with one output per action computing the policy logits $\log \pi(a_t|O_t, \theta)$; and into another FC layer with a single output that computes the value function $V(O_t; \theta_v)$.
* for MiniPacman: the CNN has two layers, both with 3x3 kernels, 16 output channels and strides 1 and 2; the following FC layer has 256 units
* for Sokoban: the CNN has three layers with kernel sizes 8x8, 4x4, 3x3, strides of 4, 2, 1 and number of output channels 32, 64, 64; the following FC has 512 units

===Sokoban===

Sokoban is a video game which is classified as a transport puzzle. The game involves the player moving pieces of boxes to get them to their target locations in an aerial view. The boxes can only be pushed and many moves become irreversible if the player don't properly plan them, which might render the puzzle unsolvable. The player is confined to the board and may move horizontally or vertically onto empty squares (never through walls or boxes). The player can also move into a box, which pushes it into the square beyond. Boxes may not be pushed into other boxes or walls, and they cannot be pulled. The number of boxes is equal to the number of storage locations. The puzzle is solved when all boxes are at storage locations.

The environment model for Sokoban is shown in figure 4
[[File:sokoban_em.png|400px|center|thumb|Figure 4: The Sokoban environment model]]

Besides, to demonstrate the influence of larger architecture in I2A, the authors set a copy-model agent that uses the same architecture of I2A but the environment model is replaced by identical map. This agent is regarded as an I2A agent without imagination.

[[File:sokoban_result.png|800px|center|thumb|Figure 5: Sokoban learning curves. Left: training curves of I2A and baselines. Right: I2A training curves for various values of imagination depth]]
The results are shown in Figure 4(left). I2A agents can solve much more levels compared to common DQN. Also, it far outperforms the copy-model version, suggesting that the environment model is crucial. The authors also trained an I2A where the environment model was predicting no rewards, only observations. This also performed worse. However, after much longer training (3e9 steps), these agents did recover the performance of the original I2A, which was never the case for the baseline agent even with that many steps. Hence, reward prediction is very helpful but not absolutely necessary in this task, and imagined observations alone are informative enough to obtain high performance on Sokoban. Note this is in contrast to many classical planning and model-based reinforcement learning methods, which often rely on reward prediction.

====Length of Rollout====
A further experiment was investigating how the length of individual rollouts affects performance. The authors performed a parameter searching. Figure 5(right) shows the influence of the rollout length. The strategy using 3 rollout steps improves the speed of learning and improves the performance significantly than 1 step, and 5 is the optimal number. This implies rollout can be very helpful and informative. This rollout enables the agent to learn moves it cannot recover from.

[[File:sokoban_noisy.png|800px|center|thumb|Figure 6: Experiments with a noisy environment model Left: each row shows an example 5-step rollout after conditioning on an environment observation. Errors accumulate and lead to various artifacts, including missing or duplicate sprites. Right: comparison of Monte-Carlo (MC) search and I2A when using either the accurate or the noisy model for rollouts.]]

====Imperfections====
To demonstrate I2A can handle less reliable predictions, the authors set experiment where the I2A used a poor environment model(smaller number of parameters), where the error may accumulate across the rollout(Figure 6 left). The authors suggest that it is learning a rollout encoder that enables I2As to deal with imperfect model predictions. We can compare them to a setup without a rollout decoder. As shown in figure 6(right), even with relatively poor environment model, the performance of I2A is stable, unlike traditional Monte-Carlo search, which explicitly estimates the value of each action from rollouts, rather than learning an arbitrary encoding of the rollouts. An interesting result is that a rollout length 5 no longer outperforms a length of 3, which matches our common sense.

====Perfections====
As I2A shows the robustness towards environment models, the authors tested an I2A agent with a nearly perfect environment model, and the results are in Table 1 and Table 2. Traditional Mento-Carlo Tree Search is tested as the baseline. From the table, although it is able to solve many levels, the search steps are very huge. On the contrary, I2A with the nearly perfect model can achieve the same fraction with much fewer steps.

====Generalization====
Lastly, the authors probe the generalization capabilities of I2As, beyond handling random level layouts in Sokoban. The agents were trained on levels with 4 boxes. Table 2 shows the performance of I2A when such an agent was tested on levels with different numbers of boxes, and that of the standard model-free agent for comparison. It turns out that I2As generalizes well; at 7 boxes, the I2A agent is still able to solve more than half of the levels, nearly as many as the standard agent on 4 boxes.
[[File:i2a_table.png|800px|center|thumb]]

===MiniPacman===
MiniPacman is a game modified from the classical game PacMan. In the game(Figure 8, left), the player explores a maze that contains food while being chased by ghosts. The maze also contains power pills; when eaten, for a fixed number of steps, the player moves faster, and the ghosts run away and can be eaten. These dynamics are common to all tasks. Each task is defined by a vector $w \in R^5$, associating a reward to each of the following five events: moving, eating food, eating a power pill, eating a ghost, and being eaten by a ghost. As such, the reward vector wrew can be interpreted as an ‘instruction’ about which task to solve in the same environment.
The goal of this part is the attempt that tries to apply the same I2A model to different tasks. The five tasks are described as follows:
* Regular: level is cleared when all the food is eaten;
* Avoid: level is cleared after 128 steps;
* Hunt: level is cleared when all ghosts are eaten or after 80 steps.
* Ambush: level is cleared when all ghosts are eaten or after 80 steps.
* Rush: level is cleared when all power pills are eaten.

[[File:minipacman_reward.png|800px|center|thumb|Table 3: the reward settings in different tasks]]

Different from the task in Sokoban, in order to capture long-range dependencies across pixels, the authors also made use of a layer that is called pool-and-inject, which applies global max-pooling over each feature map and broadcasts the resulting values as feature maps of the same size and concatenates the result to the input. Pool-and-inject layers are therefore size-preserving layers which communicate the max-value of each layer globally to the next convolutional layer. The environment model for MiniPacman is shown in Figure 7.

[[File:minipacman_model.png|800px|center|thumb|Figure 7: The MiniPacman environment model]]

To illustrate the benefits of model-based methods in this multi-task setting, the authors trained a single environment model to predict both observations (frames) and events, where the environment model is effectively shared across all tasks. Results in Figure 7(right) illustrates the benefit of the I2A architecture, outperforming the standard agent in all tasks. Note that for tasks 4 & 5, the rewards are particularly sparse, and the anticipation of ghost dynamics is especially important. The I2A agent can leverage its environment and reward model to explore the environment much more effectively.

[[File:minipacman.png|800px|center|thumb|Figure 8: Minipacman environment Left: Two frames from a minipacman game: the player is green, dangerous ghosts red, food dark blue, empty corridors black, power pills in cyan. After eating a power pill (right frame), the player can eat the 4 weak ghosts (yellow). Right: Performance after 300 million environment steps for different agents and all tasks. Note I2A clearly outperforms the other two agents on all tasks with sparse rewards.]]

[[File:imagination-946.PNG]]

The training curves for the various experimental tasks described in this paper are provided in the figure above.

=Conclusion=
In this paper, the authors applied recent success in CNN and reinforcement learning and raised a novel approach, which is a combination of model-free and model-based methods, called Imagination-augmented RL. Unlike classical model-based RL and planning methods, I2A is able to successfully use imperfect models to support model-free decisions. This approach outperforms model-free baselines in the games, MiniPacman and on the challenging, combinatorial domain of Sokoban. As experiments suggest, this method is able to successfully use imperfect models to interpret future states and rewards.

I2As trade-off environment interactions for computation by pondering before acting and thus, the imagination core part is essential in irreversible domains, where actions can have catastrophic outcomes. Compared to traditional Monte-Carlo search methods, the search space in I2A only grows linearly with the extension of the length of rollouts whereas I2As require far fewer function calls. This work may significantly broaden the applicability of model-based RL concepts and ideas.

=Insight=
This is a paper with very interesting ideas. However, it seems that the work is really hard to reproduce for an individual researcher. Since the architecture works as a whole, it is very difficult to debug each single part. Meanwhile, the training process is kind of long with up to 1e9 steps, which is also a huge requirement for computing resources.

In terms of the architecture itself, the design the CNN for the tasks seems to be very empirical. The authors did not include the reasons or rules for this part. Yet why authors applied residual connection in this shadow network is unknown. According to the paper, even the CNN network is quite simple, some details in LSTM encoder are omitted. Therefore, the backpropagation process is not so clear across the whole model.

Back to the settings of environment model, the authors used pre-trained model instead of the jointly training way. Would it be hard to train both models simultaneously?

Lastly, the authors raised a new layer as Pool-and-inject layer, the motivation and plausibility are not so clear. It would be better if the authors can compare it with common pooling layer.

In spite of some missing details, this is a solid work with a novel idea and many tricks. In addition, the settings of the experiment are quite inspiring where we can learn from.

The use of memory networks instead of LSTM can alleviate the problem of remembering long-term rewards. Performing inference over the memory can lead to more accurate insight generation for internal simulations which is performed by the imagination augmented agents

=Reference=
# A commentary of the paper by the authors can be found on: https://www.youtube.com/watch?v=agXIYMCICcc
# Buesing, L., Badia, A.P., Battaglia, P.W., Guez, A., Heess, N., Li, Y., Pascanu, R., Racanière, S., Reichert, D.P., Rezende, D.J., Silver, D., Vinyals, O., Weber, T., & Wierstra, D. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. CoRR, abs/1707.06203.
# YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
# Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
# Jessica B. Hamrick, Andy J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
# Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, David Reichert, Theophane Weber, Sebastien Racaniere, Lars Buesing, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv preprint, 2017.
# Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
#Introduction to MCTS http://mcts.ai/about/index.html
#Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, Sergey Levine. "Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning". arxiv pre-print; arXiv:1703.03078 [cs.RO]
#Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
#YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
#Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
#Jessica B. Hamrick, Andy J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
#Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.

= Appendix =
This paper provides a rich appendix that expounds upon the authors implementation in much greater detail.
== A Training and the rollout policy distribution details ==
As in other reinforcement learning works each agent used in the paper defines a stochastic policy. While training the models, to increase the probability of an action being taken, A3C applies an update $\Delta \theta$ to the parameters $\theta$ using policy gradient $g(\theta)$:

$ g(\theta) = \nabla_{\theta}log(\pi)(a_{t}|o_{t};\theta)A(o_{t}; \theta_{v})$

== C MiniPacman additional details ==
=== Task collection ===
[[File:task_collection.PNG]]

File:imagination-946.PNG

2017-11-14T19:02:41Z

A2prasad:

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:47:57Z

A2prasad: /* Matching Networks */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:
\begin{align*}
a(\hat{x},x_i)=\frac{e^{d(f(\hat{x},S), g(x_i,S))}}{\sum_{j}e^{d(f(\hat{x},S), g(x_j,S))}},
\end{align*}

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\bigr[\sum_{(x,y)\in B}\log P(y|x,S)\bigr]\biggr]
\end{align*}

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:46:24Z

A2prasad: /* Matching Networks */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:
\begin{align*}
a(\hat{x},x_i)=\frac{e^{d(f(\hat{x},S),g(x_i,S))}}{\sum_{j}e^{d(f(\hat{x},S),g(x_j,S))}},
\end{align*}

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\bigr[\sum_{(x,y)\in B}\log P(y|x,S)\bigr]\biggr]
\end{align*}

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:45:56Z

A2prasad: /* Matching Networks */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:
\begin{align*}
a(\hat{x},x_i)=\frac{e^{d(f(\hat{x},S),g(x_i,S))}}{\sum_{j}e^{d(f(\hat{x},S),g(x_j,S))}}
\end{align*}

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\bigr[\sum_{(x,y)\in B}\log P(y|x,S)\bigr]\biggr]
\end{align*}

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:42:06Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\bigr[\sum_{(x,y)\in B}\log P(y|x,S)\bigr]\biggr]
\end{align*}

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:41:51Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\bigr[\sum_{(x,y)\in B}log P(y|x,S)\bigr]\biggr]
\end{align*}

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:41:38Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\bigr[\sum_{(x,y)\in B}log P(y|x,S)\bigr]\biggr]
\end{align*}

[[File:MN2.PNG|centre|650px]]

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:41:18Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}\biggr[\sum_{(x,y)\in B}log P(y|x,S)\biggr]\biggr]
\end{align*}

[[File:MN2.PNG|centre|650px]]

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:41:01Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}\biggr[E_{S \sim C, B \sim C}[\sum_{(x,y)\in B}log P(y|x,S)]\biggr]
\end{align*}

[[File:MN2.PNG|centre|650px]]

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:40:02Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}[E_{S \sim C, B \sim C}[\sum_{(x,y)\in B}log P(y|x,S)]]
\end{align*}

[[File:MN2.PNG|centre|650px]]

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

STAT946F17/Cognitive Psychology For Deep Neural Networks: A Shape Bias Case Study

2017-11-08T02:39:51Z

A2prasad: /* Training MN */

= Introduction =

The recent burgeon on the use of Deep Neural Networks (DNNs) have resulted in giant leaps of accuracy in prediction. They are also being used to solve a variety of complex tasks which earlier methodologies have struggled to excel in.

While it is all good to see incredibly high accuracy as a result of the use of DNN, we must begin to question why they perform so well. It has become an interesting field of study to actually represent the features/feature maps or interpret the meaning of the learnt values in a DNN's hidden layers. Currently we treat models of DNNs as black boxes which we practically tune the tweakable parameters like number of layers, number of units in each layer, number & size of feature maps(in case of CNN) etc. The opacity created by the lack of an intuitive representation of the internal learnt parameters of DNNs hinders both basic research as well as its application to real world problems.

Recent pushes have aimed to better understand DNNs: tailor-made loss functions and architectures produce more interpretable features (Higgins et al., 2016; Raposo et al., 2017) while output-behavior analyses unveil previously opaque operations of these networks (Karpathy et al., 2015). Parallel to this work, neuroscience-inspired methods such as activation visualization (Li et al., 2015), ablation analysis (Zeiler & Fergus, 2014) and activation maximization (Yosinski et al., 2015) have also been applied

This paper aims to provide another methodology to attempt to decipher & better understand how DNNs solve a particular task. This methodology was inspired by psychological concepts to test whether the DNN's were able to make accurate predictions with biases similar to that the human mind makes.

Research in developmental psychology shows that when learning new words, humans tend to assign the same name to similarly shaped items rather than to items with similar color, texture, or size. This bias/knowledge tend to be forged into the brains of humans and humans then take this forward to easily associate these shapes with new objects they have not seen before.

The authors of this paper try to simulate if DNNs behave similarly in one-shot learning applications. They attempt to prove that when the models of state-of-the-art DNNs are used to learn objects from images, they exhibit a stronger shape bias than a color bias. To emulate the human brain, they use the parameters of pre-trained DNN models and use this to perform one-shot learning on a new data set with different labels.

= Background =
== One Shot Learning ==
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The one-shot word learning task is to label a novel data example $\hat{x}$ (e.g. a novel probe image) with a novel class label $\hat{y}$ (e.g. a new word) after only a single example.

More specifically, given a support set $S = {(x_i, y_i) , i \in [1, k]}$, of images $x_i$, and their associated labels $y_i$, and an unlabeled probe image $\hat{x}$,
the one-shot learning task is to identify the true label of the probe image, $\hat{y}$, from the support set labels $ {y_i , i \in [1, k]} $:

$\displaystyle \hat{y} = arg \max_{y}$ $P(y | \hat{x}, S)$

We assume that the image labels $y_i$ are represented using a one-hot encoding and that $P(y|\hat{x}, S)$ is parameterised by a DNN, allowing us to leverage the ability of deep networks to learn powerful representations.

== Inception Networks ==

A probe image $\hat{x}$ is given the label of the nearest neighbour from the
support set:

$\hat{y} = y$

$(x, y) = \displaystyle arg \min_{(x_i,y_i) \in S} d(h(x_i), h(\hat{x})) $

where d is a distance function.

The function h is parameterized by Inception – one of the best performing ImageNet classification models. Specifically, h returns features from the last layer (the softmax input) of a pre-trained Inception classifier. With these features as input and cosine distance as the distance function, the classifier in achieves 87.6% accuracy on one-shot classification on the ImageNet dataset (Vinyals et al., 2016). We call the Inception classifier together with the nearest-neighbor component the Inception Baseline (IB) model.

== Matching Networks ==

MNs (Vinyals et al.,2016) are neural network architectures with state-of-the-art one shot learning performance on ImageNet (93.2% one-shot labelling accuracy).
MNs are trained to assign label $\hat{y}$ to probe image $\hat{x}$ using an attention mechanism a acting on image embeddings stored in the support set S:

[[File:MN1.PNG|centre|650px]]

where d is a cosine distance and where f and g provide context-dependent embeddings of $\hat{x}$ and $x_i$ (with contextS). The embedding $g(x_i, S)$ is a bi-directional LSTM (Hochreiter & Schmidhuber, 1997) with the support set S provided as an input sequence. The embedding $f(\hat{x}, S)$ is an LSTM with a read-attention mechanism operating over the entire embedded support set. The input to the LSTM is given by the penultimate layer features of a pre-trained deep convolutional network, specifically Inception.

To train MNs we proceed as follows:

=== Training MN ===
* Step 1: At each step of training, the model is given a small support set of images and associated labels. In addition to the support set, the model is fed an unlabeled probe image $\hat{x}$

* Step 2: The model parameters are then updated to improve classification accuracy of the probe image $\hat{x}$ given the support set. Parameters are updated using stochastic gradient descent with a learning rate of 0.1

* Step 3: After each update, the labels ${(y_i, i \in [1, k]}$ in the training set are randomly re-assigned to new image classes (the label indices are randomly permuted,
but the image labels are not changed). This is a critical step. It prevents MNs from learning a consistent mapping between a category and a label. Usually, in classification, this is what we want, but in one-shot learning we want to train our model for classification after viewing a single in-class example from the support set.

The objective function used is:

\begin{align*}
L=E_{C\sim T}[E_{S \sim C, B \sim C}[\sum_{(x,y)\in B}log \P(y|x,S)]]
\end{align*}

[[File:MN2.PNG|centre|650px]]

where T is the set of all possible labelings of our classes, S is a support set sampled with a class labeling C ~ T and B is a batch of probe images and labels, also with the same randomly chosen class labeling as the support set.

== Cognitive Biases ==
Cognitive bias is a concept from developmental psychology which attempts to explain how children can extract meanings of words with very few examples, similar to the concept of one-shot learning discussed above. The theory, as explained by the authors, is that humans form biases that allow them to eliminate many potential hypotheses about word meaning where the amount of data available is insufficient for this purpose. These include:
* Whole object bias
* Taxonomic bias
* Mutual exclusivity bias
* Shape bias
A more complete list of cognitive biases is given by [[#References|(Bloom, 2000)]]. The bias the authors investigate in this paper is the shape bias, which denotes a tendency to assign the same name to similarly shaped items rather than to items with similar color, texture, or size.

= Methodology =
== Inductive Biases & Probe Data ==

Inductive biases are those criteria which are artificially selected or learnt by the network as a classifying/distinguishing property.
It has been observed that the biases that DNNs learnt are complex composite features. We, as researchers can take advantage of the fact that DNNs learnt complex distinguishing features by constructing probe data sets which particularly target on exposing a particular bias that a DNN might have.

* Step 1: Take a known composite feature which we suspect the DNNs are biased against
* Step 2: Train the target model with an appropriate dataset
* Step 3: Transfer Learning: Use the pre-trained model with a new data set which is curated to contain data to prove/disprove the existence of the bias
* Step 4: Model/Decide on a function which quantifies the bias under study
* Step 5: Measure the bias with the bias function

== Data Sets Used ==

* Training Set: ImageNet
* Test Set:
** The Cognitive Psychology Probe Data (CogPsyc data) that is used consists of 150 images of objects. The images are arranged in triples consisting of a probe image, a shape-match image (that matches the probe in colour but not shape), and a color-match image (that matches the probe in shape but not colour). In the dataset there are 10 triples, each shown on 5 different backgrounds, giving a total of 50 triples. [[File:CogPsy.PNG|center|350px]]
** A real-world dataset consisting of 90 images of objects (30 triples) collected using Google Image Search. The images are arranged in triples consisting of a probe, a shape-match and a colour-match.

= Experiments =
== Evaluation Criteria ==

* For a given probe image $\hat{x}$, we loaded the shape-match image $x_s$ and corresponding label $y_s$, along with the colour-match image $x_c$ and corresponding label $y_c$ into memory, as the support set $S = \{(x_s, ys), (x_c, y_c)\}$
* Calculate $\hat{y}$
* The model assigns either $y_c$ or $y_s$ to the probe image.
* To estimate the shape bias Bs, calculate the proportion of shape labels assigned to the probe: $B_s = E(\delta(\hat{y} - y_s))$
where E is an expectation across probe images and $\delta$ is the Dirac delta function.

== Experiment 1: Shape bias statistics in Inception Baseline: ==
* Shape bias of IB to be $B_s = 0.68$. Similarly, the shape bias of IB using our real-world dataset was $B_s = 0.97$. Together, these results strongly suggest that IB trained on ImageNet has a stronger bias towards shape than colour

== Experiment 2: Shape bias statistics in Matching Network: ==
* They found that MNs have a shape of bias $B_s = 0.7$ using the CogPsyc dataset and a bias of $Bs = 1$ using the real-world dataset. Once again, these results suggest that MNs trained seeding from Inception using ImageNet has a stronger bias towards shape than colour.

== Experiment 3: Shape bias statistics between and across models: ==

The authors extended the shape bias analysis to calculate the shape bias in a population of IB models and in a population of MN models with different random initialization

=== Dependence on the initialization of parameters: ===

[[File:3.1.PNG|right|250px]] A strong variability was observed when variation in the initial values of the parameters. For the CogPsyc dataset, the average shape bias was $B_s = 0.628$ with standard deviation $\sigma B_s = 0.049$ at the end of training and for the real-world dataset the average shape bias was $B_s = 0:958$ with $\sigma B_s = 0.037$.

=== Dependence of shape bias on model performance: ===

For the CogPsych dataset, the correlation between bias and classification accuracy was $\rho = 0.15$, and for the real world dataset, correlation between bias and classification accuracy was $\rho = -0.06$. This would be evident since the accuracy of the models remained nearly constant when the initialization parameters varied whereas the shape bias tended to vary a lot, hence highlighting the lack of correlation amongst them.

=== Emergence of shape bias during training: ===
The shape bias spiked to a large value very early.

=== Variation of shape bias within models & across models: ===
With different initialization parameters, the shape bias varied a lot within IB during training while the shape bias did not fluctuate during the training of MN. It was found that the MN inherits the shape bias of the IB which seeded its embeddings and thereafter, the shape bias remained constant throughout training. It is important to note that the output of the penultimate layer of the Inception was not fine tuned when it was pipelined to the MN. This was to ensure that the MN properties were independent of the IB model properties. [[File:3.3.PNG|center|250px]] [[File:3.4.PNG|center|250px]]

= Learnings, Inferences & Implications =
* Both the Inception Baseline and the Matching Network exhibit strong shape bias when trained on ImageNet. Researchers who use Inception & MN DNNs can now use this fact as a consideration for their application while using pre-trained models for new datasets. If it is known before hand that the new data set is strongly classifiable through a color bias, then they would either want to defer using the pre-trained models or explore methods to decrease/remove the strong shape bias.

* There exists a high variability in the shape bias with the variation in the initialization parameters. This is an important finding since it uncovers the fact that the same architecture which exhibit similar accuracy in predictions can display a variety of shape bias just with different initialization parameters. Researchers can explore methods of tuning the random initialization such that the models start out with a low shape bias without compromising the accuracy of the model.

* MNs inherit the shape bias which is seeded to it by the Inception Network's input embedding. This is also another fact which researchers & practitioners should be careful about. When using cascaded or pipelined heterogeneous architectures, the models downstream tend to inherit/become/are fed with the properties/biases of the models upstream. This may be desirable or undesirable according to the application, but it is important to be aware of its presence.

* The biases under consideration are the property of the collection of the architecture, the dataset and the optimization procedure. Hence in order to increase or decrease the effect of a particular bias, one or more of the mentioned factors must be adjusted/tuned/changed.

* The fact that a high shape bias emerged in the early epochs with less variability in further epochs can be thought of analogous to the biases that humans develop at an infancy which gets fortified as they age.

= Conclusion, Future Work and Open questions =

* Just as cognitive psychology exposes the shape bias observed in this experiment, we should try to uncover other biases as well using multiple approaches
* Study the underlying mechanisms which cause biases such as shape bias in DNNs
* Research into various methods of probing and creating probe data sets which can be used to test architectures for various biases
* Exploration into a research field called Artificial Cognitive Psychology which focuses on probing how DNN architectures can be understood further using known behaviors of the human brain

= References =

* Ritter, Samuel & G. T. Barrett, David & Santoro, Adam & M. Botvinick, Matt. (2017). Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

* Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching networks for one shot learning. arXiv preprint arXiv:1606.04080, 2016.

* Bloom, P. (2000). How children learn the meanings of words. The MIT Press.

* https://www.slideshare.net/KazukiFujikawa/matching-networks-for-one-shot-learning-71257100

* https://deepmind.com/blog/cognitive-psychology/

* https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/

FeUdal Networks for Hierarchical Reinforcement Learning

2017-11-08T01:53:03Z

A2prasad: /* References */

= Introduction =

Even though deep reinforcement learning has been hugely successful in a variety of domains, it has not been able to succeed in environments which have sparsely spaced reward signals and encounters the major challenge of long-term credit assignment, where the agent is not able to attribute a reward to an action taken several timesteps back.

This paper proposes a hierarchical reinforcement learning architecture (HRL), called FeUdal Networks (FuN), which has been inspired from Feudal Reinforcement Learning (FRL)[3]. It is a fully-differentiable neural network with two levels of hierarchy – a Manager module at the top level and a Worker module below. The Manager sets abstract goals, which are learned, at a lower temporal resolution in a latent state-space. The Worker operates at a higher temporal resolution and produces primitive actions at every tick of the environment, motivated to follow the goals received from Manager, by an intrinsic reward.

The key contributions of the authors in this paper are: (1) A consistent, end-to-end differentiable FRL inspired HRL. (2) A novel, approximate transition policy gradient update for training the Manager (3) The use of goals that are directional rather than absolute in nature. (4) Dilated LSTM – a novel RNN design for the Manager that allows gradients to flow through large hops in time.

The experiments conducted on several tasks which involve sparse rewards show that FuN significantly outperforms a strong baseline agent on tasks that involve long-term credit assignment and memorization.

= Related Work =

Several hierarchical reinforcement learning models were proposed to solve this problem. The options framework [4] considers the problem with a two-level hierarchy, with options being typically learned using sub-goals and ‘pseudo-rewards’ that are provided explicitly. Whereas, the option-critic architecture[1] uses the policy gradient theorem for learning options in an end-to-end fashion. A problem with learning options end-to-end is that they tend to a trivial solution where: (i) only one option is active, which solves the whole task; (ii) a policy-over-options changes options at every step, micro-managing the behavior. The authors state that the option-critic architecture is the only other end-to-end trainable system with sub-policies.

Non-hierarchical deep RL (non-HRL) methods using auxiliary losses and rewards such as pseudo count for exploration[2] have significantly improved results by stimulating agents to explore new parts of the state space. The UNREAL agent[9] is another non-HRL method that showed a strong improvement using unsupervised auxiliary tasks.

= Model =

[[File:feudal_network_model_diagram.png|frame]]

A high-level explanation of the model is as follows:

The Manager computes a latent state representation <math>s_t</math> and outputs a goal vector <math>g_t</math> . The Worker outputs actions based on the environment observation, its own state, and the Manager’s goal. A perceptual module computes intermediate representation, <math>z_t</math> of the environment observation <math>x_t</math>, and is shared as input by both Manager and Worker. The Manager’s goals <math>g_t</math> are trained using an approximate transition policy gradient. The Worker is then trained via intrinsic reward which stimulates it to output actions that will achieve the goals set by the Manager.

<center>
[[File:model_definition.png|500px]]
</center>

Manager and Worker are recurrent networks (<math>{h^M}</math> and <math>{h^W}</math> being their internal states). <math>\phi</math> is a linear transform that maps a goal <math>g_t</math> into an embedding vector <math>w_t \in {R^k}</math> , which is then combined with matrix <math>U_t</math> (Worker's output) via a matrix-vector product to produce policy <math>\pi</math> – vector of probabilities over primitive actions. The projection <math>\phi</math> is linear, with no biases, and is learnt with gradients coming from the Worker’s actions.Since <math>\phi</math> has no biases it can never produce a constant non-zero vector – which is the only way the setup could ignore the Manager’s input. This makes sure that the goal output by the Manager always influences the final policy.

===Learning===
The learning considers a standard reinforcement learning setup where the goal of the agent is to maximize the discounted return <math>R_t = \sum_{k=0}^{∞} \gamma^k r_{t+k+1}</math>; where <math>\gamma \in [0,1]; r_t</math> is the reward from environment for action at timestep, <math>t</math>. The agent's behavior is defined by its action-selection policy, <math>\pi</math>.

Since FuN is fully differentiable, the authors could have trained it end-to-end using a policy gradient algorithm operating on the actions taken by the Worker such the outputs <math>g</math> of the Manager would be trained by gradients coming from the Worker. This, however, would deprive Manager’s goals <math>g</math> of any semantic meaning, making them just internal latent variables of the model. So instead, Manager is independently trained to predict advantageous directions (transitions) in state space and to intrinsically reward the Worker to follow these directions.

Update rule for manager:

<center>
<math>\nabla g_t = A_t^M \nabla_\theta d_{cos}(s_{t+c} - s_t, g_t(\theta))</math>
</center>

In above equation, <math>d_{cos}(\alpha, \beta) = \alpha^T \beta/(|\alpha||\beta|)</math> is the cosine similarity between two vectors and <math>A_t^M = R_t - V_t^M(x_t,\theta)</math> is the Manager’s advantage function, computed using a value function estimate <math>V_t^M(x_t,\theta)</math> from the internal critic. Here c is an event horizon for the Manager to optimize its direction on. It must be treated as a hyperparameter of the model. It controls the temporal resolution of the Manager.

The intrinsic reward that encourages the Worker to follow the goals are defined as:

<center>
<math>r_t^I = 1/c \sum_{i=1}^c d_{cos}(s_t - s_{t-i}, g_{t-i})</math>
</center>

Compared to FRL[3], which advocated concealing the reward from lower levels of the hierarchy, the Worker in FuN network is trained using an advantage actor-critic[5] to maximise a weighted sum <math>R_t + α R_t^I</math> , where <math>α</math> is a hyper-parameter that regulates the influence of the intrinsic reward:

<center>
<math>\nabla {\pi}_t = A_t^D \nabla_\theta log \pi (a_t|x_t;\theta)</math>
</center>

The Advantage function <math>A_t^D = (R_t + \alpha R_t^I - V_t^D(x_t;\theta))</math> is calculated using an internal critic, which estimates the value functions for both rewards.

===Transition Policy Gradient===
The update rule for the Manager given above is a novel form of policy gradient with respect to a ''model'' of the Worker’s behavior. The Worker can follow a complex trajectory but it is not necessarily required to learn from these samples. If the trajectories can be predicted, by modeling the transitions, then the policy gradient of the predicted transition can be followed instead of the Worker's actual path. FuN assumes a particular form for the transition model: that the direction in state-space, <math>s_{t+c} − s_t</math>, follows a von Mises-Fisher distribution.

=Architecture=
The perceptual module <math>f^{percept}</math> is a convolutional network (CNN) followed by a fully connected layer. Each convolutional and fully-connected layer is followed by a rectifier non-linearity. <math>f_{Mspace}</math>, which is another fully connected layer followed by a rectifier non-linearity, is used to compute the state space, which the Manager uses to formulate goals. The Worker’s recurrent network <math>f^{Wrnn}</math> is a standard LSTM[6].

The Manager uses a novel architecture called a dilated LSTM (dLSTM), which operates at lower temporal resolution than the data stream. It is similar to dilated convolutional networks[7] and clockwork RNN. For a dilation radius r, the network is composed of r separate groups of sub-states or ‘cores’, denoted by <math>h = \{\hat{h}^i\}_{i=1}^r</math>. At time <math>t</math>, the network is governed by the following equations: <math>\hat{h}_t^{t\%r},g_t = LSTM(s_t, \hat{h}_{t-1}^{t\%r};\theta^{LSTM})</math> where % denotes the modulo operation and allows us to indicate which group of cores is currently being updated. At each time step, only the corresponding part of the state is updated and the output is pooled across the previous c outputs. This allows the r groups of cores inside the dLSTM to preserve the memories for long periods, yet the dLSTM as a whole is still able to process and learn from every input experience and is also able to update its output at every step.

=Experiments=
The baseline the authors are using is a recurrent LSTM[6] network on top of a representation learned by a CNN. The A3C method[5] is used for all reinforcement learning experiments. Backpropagation through time (BPTT)[8] is run after K forward passes of a network or if a terminal signal is received. For each method, 100 experiments were run. A training epoch is defined as one million observations.

==Montezuma’s Revenge==
Montezuma’s revenge is a prime example of an environment with sparse rewards. FuN starts learning much earlier and achieves much higher scores. It takes > 300 epochs for LSTM to reach the score 400, which corresponds to solving the first room (take the key, open a door). FuN solves the first room in less than 200 epochs and immediately moves on to explore further, eventually visiting several other rooms and scoring up to 2600 points.

<center>
[[File:feudal_figure2.png|900px]]
</center>

==ATARI==
The experiment was run on a diverse set of ATARI games, some of which involve long-term credit assignment and some which are more reactive. Enduro stands out as all the LSTM agents completely fail at it. Frostbite is a hard game that requires both long-term credit assignment and good exploration. The best-performing frostbite agent is FuN with 0.95 Manager discount, which outperforms the rest by a factor of 7. The other results can be seen in the figure.

<center>
[[File:feudal_figure4.png|900px]]
</center>

==Comparing the option-critic architecture==
FuN network was run on the same games as Option-Critic (Asterix, Ms. Pacman, Seaquest, and Zaxxon) and after 200 epochs it achieves a similar score on Seaquest, doubles it on Ms. Pacman, more than triples it on Zaxxon and gets more than 20x improvement on Asterix.

<center>
[[File:feudal_figure7.png]]
</center>

==Memory in Labyrinth==
DeepMind Lab (Beattie et al., 2016) is a first-person 3D game platform extended from OpenArena. The games on which the experiments were run on include a Water maze, T-maze, and Non-match (which is a visual memorization task). FuN consistently outperforms the LSTM baseline – it learns faster and also reaches a higher final reward. Interestingly, the LSTM agent doesn’t appear to use its memory for water maze task at all, always circling the maze at the roughly the same radius.

<center>
[[File:feudal_figure5.png|800px]]
[[File:feudal_figure6.png|800px]]
</center>

==Ablative Analysis==
Empirical evaluation of the main contributions of this paper:

===Transition policy gradient===
Experiments were run on modified FuN networks in which: 1) the Managers output g is trained with gradients coming directly from the Worker and no intrinsic reward is used, 2) g is learned using a standard
policy gradient approach with the Manager emitting the mean of a Gaussian distribution from which goals are sampled, 3) a variant of FuN in which g specifies absolute, rather than relative/directional, goals and 4) a purely feudal version of FuN – in which the Worker is trained from the intrinsic reward alone. The experiments (Figure 8) reveal that, although alternatives do work to some degree their performance is significantly inferior.

<center>
[[File:feudal_figure8.png|900px]]
</center>

===Temporal resolution ablations===
To test the effectiveness of the dilation LSTM, FuN was compared with two baselines 1) the Manager uses a vanilla LSTM with no dilation; 2) FuN with Manager’s prediction horizon c = 1. The non-dilated LSTM fails catastrophically, most likely overwhelmed by the recurrent gradient. Reducing the horizon c to 1 did hurt the performance, although not that much, which means that even at high temporal resolution Manager captures certain properties of the underlying MDP.

<center>
[[File:feudal_figure10.png|900px]]
</center>

===Intrinsic motivation weight===
Evaluates the effect of weight <math>α</math> which regulates the relative weight of intrinsic reward. Figure below shows scatter plots of agents final score vs α hyper-parameter where there is a clear improvement in score for high <math>\alpha</math> in some games.

<center>
[[File:feudal_figure11.png|900px]]
</center>

===Dilate LSTM agent baseline===
For this experiment, just the dLSTM is used in an agent on top of a CNN, without the rest of FuN structures. Figure below plots the learning curves for FuN, LSTM, and dLSTM agents. dLSTM generally underperforms both LSTM and FuN.

<center>
[[File:feudal_figure12.png|900px]]
</center>

===ATARI action repeat transfer===
This experiment is to demonstrate that the transition policy can be transferred between agents with different embodiment, for example, across agents with different action repeat on ATARI. The figure below shows the corresponding learning curves. The transferred FuN agent (green curve) significantly outperforms every other method.

<center>
[[File:feudal_figure9.png|900px]]
</center>

=Conclusion=
FuN currently holds state-of-the-art score in the Atari game, Montezuma's revenge among HRL methods. It is a novel approach to hierarchical reinforcement learning which separates the goal setting behavior from the generation of action primitives. This creates a natural hierarchy that is stable and the experiments clearly demonstrate that the FeUdal network makes long-term credit assignment and memorization more tractable.

Deeper hierarchies by setting goals at multiple time scales is an avenue for further research. The modular structure looks promising for transfer and multitask learning as well.

An implementation of this paper can be found on : https://github.com/dmakian/feudal_networks

=References=
#Bacon, Pierre-Luc, Precup, Doina, and Harb, Jean. The option-critic architecture. In AAAI, 2017.
#Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation.In NIPS, 2016a.
#Dayan, Peter and Hinton, Geoffrey E. Feudal reinforcement learning. In NIPS. Morgan Kaufmann Publishers,1993.
#Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.
#Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza,Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim,Silver, David, and Kavukcuoglu, Koray. Asynchronousmethods for deep reinforcement learning. ICML, 2016.
#Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 1997.
#Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. ICLR, 2016.
#Mozer, Michael C. A focused back-propagation algorithm for temporal pattern recognition. Complex systems, 1989.
#Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver,David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
#A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
# https://www.quora.com/What-is-hierachical-reinforcement-learning
# Tutorial for Hierarchial Reinforcement Learning: https://www.youtube.com/watch?v=K5MlmO0UJtI
# Videos of FUN agent playing various Atari games can be found in supplementary file accessed through: http://proceedings.mlr.press/v70/vezhnevets17a.html

Learning the Number of Neurons in Deep Networks

2017-11-07T23:44:08Z

A2prasad: /* Related Work */

='''Introduction'''=

Due to the availability of large-scale datasets and powerful computation, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Models and Computer Vision. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by errors manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.

In this paper, we used an approach to automatically select the number of neurons in each layer when we learn the network. Our approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than trains an initial network as as pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]). We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, our approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.

In the experiments on several image recognition datasets, we showed the effectiveness of our approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks, and occupies less memory.

='''Related Work'''=

The recent researches tend to build very deep networks. Building very deep networks means we need to learn more parameters, which leads to a significant cost on the memory of the equipment as well as the speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] at the process of learning. However, we know shallow networks have fewer parameters, so that it can not handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts by a deep network to reduce a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] has little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network.

Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works has proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network, since it acts as as post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons each layer is determined manually. A very similar work using the group lasso method for CNN was previously done in [Liu et al., 2015]. The big-picture idea appears to be very similar but they differ in details of methodology.

='''Model Training and Model Selection'''=

In general, a deep network has L layers containing linear operations on their inputs, intertwined with activation functions. The activation function we generally use is '''Rectified Linear Units(RELU) or sigmoids'''. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Given an input $x$, under the linear, on-linear and pooling operations, we obtain the output $\hat{y}=f(x,\theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.

At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Our choice for the regularizer can be $\ell_{2}$-norm(i.e, weight decay) or $\ell_{1}$-norm. $\ell_{2}$-norm usually favours small parameter values, and $\ell_{1}$-norm can only delete those irrelevant parameters, but not the neurons. The goal in this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieve this goal. Here, we make use of the '''group sparsity''' [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers, and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too much neurons in the first few layers, so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.

The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''', which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1},$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.

This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression. In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus a L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus a L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms.

To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it:

We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$

Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$

Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem:
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero.

='''Experiment'''=

==='''Set Up'''===

They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a].

For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$. $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.

For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.

==='''Results'''===

[[File:imageNet.png]]

The above table show the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$.

[[File:44.png]]

[[File:2.png]]

The above figures reports the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in number of neurons is spread all the layers.

[[File:ICDA.png]]

Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.

All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.

='''Analysis on Testing'''=

Our algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network, but also decreases the computational memory cost and increases the speed.

[[File:table2.png]]

The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time.

='''Conclusion'''=

In this paper, we have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, we found our approach not only reduces the number of parameters in our model, but also saves the computation memory and increases the speed at test time. However, the limitation of our approach is that the number of layers in the network remains fixed.

='''Critique'''=
The authors of the paper state that ``...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have
two different weights: a relatively small one for the first few layers, and a larger weight for the
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus
retains enough information for the remaining ones.`` However, the authors fail to present any guidance as to what gets counted as ``the first few layers`` and what the relative sizes for the two weights should be even after we have chosen the ``first few layers``. Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper.

The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check.

='''References'''=

P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.

M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.

Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.

I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.

G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.

M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.

M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.

N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.

H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.

Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf

Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):
# http://www.onmyphd.com/?p=proximal.operator
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator

Learning the Number of Neurons in Deep Networks

2017-11-07T23:40:50Z

A2prasad: /* References */

='''Introduction'''=

Due to the availability of large-scale datasets and powerful computation, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Models and Computer Vision. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by errors manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.

In this paper, we used an approach to automatically select the number of neurons in each layer when we learn the network. Our approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than trains an initial network as as pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]). We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, our approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.

In the experiments on several image recognition datasets, we showed the effectiveness of our approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks, and occupies less memory.

='''Related Work'''=

The recent researches tend to build very deep networks. Building very deep networks means we need to learn more parameters, which leads to a significant cost on the memory of the equipment as well as the speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] at the process of learning. However, we know shallow networks have fewer parameters, so that it can not handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts by a deep network to reduce a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] has little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network.

Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works has proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network, since it acts as as post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons each layer is determined manually.

='''Model Training and Model Selection'''=

In general, a deep network has L layers containing linear operations on their inputs, intertwined with activation functions. The activation function we generally use is '''Rectified Linear Units(RELU) or sigmoids'''. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Given an input $x$, under the linear, on-linear and pooling operations, we obtain the output $\hat{y}=f(x,\theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.

At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Our choice for the regularizer can be $\ell_{2}$-norm(i.e, weight decay) or $\ell_{1}$-norm. $\ell_{2}$-norm usually favours small parameter values, and $\ell_{1}$-norm can only delete those irrelevant parameters, but not the neurons. The goal in this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieve this goal. Here, we make use of the '''group sparsity''' [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers, and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too much neurons in the first few layers, so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.

The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''', which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1},$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.

This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression. In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus a L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus a L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms.

To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it:

We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$

Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$

Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem:
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero.

='''Experiment'''=

==='''Set Up'''===

They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a].

For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$. $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.

For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.

==='''Results'''===

[[File:imageNet.png]]

The above table show the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$.

[[File:44.png]]

[[File:2.png]]

The above figures reports the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in number of neurons is spread all the layers.

[[File:ICDA.png]]

Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.

All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.

='''Analysis on Testing'''=

Our algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network, but also decreases the computational memory cost and increases the speed.

[[File:table2.png]]

The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time.

='''Conclusion'''=

In this paper, we have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, we found our approach not only reduces the number of parameters in our model, but also saves the computation memory and increases the speed at test time. However, the limitation of our approach is that the number of layers in the network remains fixed.

='''Critique'''=
The authors of the paper state that ``...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have
two different weights: a relatively small one for the first few layers, and a larger weight for the
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus
retains enough information for the remaining ones.`` However, the authors fail to present any guidance as to what gets counted as ``the first few layers`` and what the relative sizes for the two weights should be even after we have chosen the ``first few layers``. Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper.

The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check.

='''References'''=

P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.

M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.

Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.

I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.

G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.

M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.

M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.

N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.

H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.

Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf

Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):
# http://www.onmyphd.com/?p=proximal.operator
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator

Learning the Number of Neurons in Deep Networks

2017-11-07T23:40:31Z

A2prasad: /* References */

='''Introduction'''=

Due to the availability of large-scale datasets and powerful computation, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Models and Computer Vision. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by errors manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.

In this paper, we used an approach to automatically select the number of neurons in each layer when we learn the network. Our approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than trains an initial network as as pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]). We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, our approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.

In the experiments on several image recognition datasets, we showed the effectiveness of our approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks, and occupies less memory.

='''Related Work'''=

The recent researches tend to build very deep networks. Building very deep networks means we need to learn more parameters, which leads to a significant cost on the memory of the equipment as well as the speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] at the process of learning. However, we know shallow networks have fewer parameters, so that it can not handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts by a deep network to reduce a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] has little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network.

Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works has proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network, since it acts as as post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons each layer is determined manually.

='''Model Training and Model Selection'''=

In general, a deep network has L layers containing linear operations on their inputs, intertwined with activation functions. The activation function we generally use is '''Rectified Linear Units(RELU) or sigmoids'''. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Given an input $x$, under the linear, on-linear and pooling operations, we obtain the output $\hat{y}=f(x,\theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.

At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Our choice for the regularizer can be $\ell_{2}$-norm(i.e, weight decay) or $\ell_{1}$-norm. $\ell_{2}$-norm usually favours small parameter values, and $\ell_{1}$-norm can only delete those irrelevant parameters, but not the neurons. The goal in this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieve this goal. Here, we make use of the '''group sparsity''' [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers, and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too much neurons in the first few layers, so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.

The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''', which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1},$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.

This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression. In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus a L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus a L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms.

To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it:

We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$

Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$

Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem:
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero.

='''Experiment'''=

==='''Set Up'''===

They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a].

For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$. $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.

For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.

==='''Results'''===

[[File:imageNet.png]]

The above table show the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$.

[[File:44.png]]

[[File:2.png]]

The above figures reports the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in number of neurons is spread all the layers.

[[File:ICDA.png]]

Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.

All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.

='''Analysis on Testing'''=

Our algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network, but also decreases the computational memory cost and increases the speed.

[[File:table2.png]]

The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time.

='''Conclusion'''=

In this paper, we have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, we found our approach not only reduces the number of parameters in our model, but also saves the computation memory and increases the speed at test time. However, the limitation of our approach is that the number of layers in the network remains fixed.

='''Critique'''=
The authors of the paper state that ``...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have
two different weights: a relatively small one for the first few layers, and a larger weight for the
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus
retains enough information for the remaining ones.`` However, the authors fail to present any guidance as to what gets counted as ``the first few layers`` and what the relative sizes for the two weights should be even after we have chosen the ``first few layers``. Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper.

The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check.

='''References'''=

P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.

M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.

Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.

I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.

G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.

M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.

M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.

N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.

H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.

Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf
Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):
# http://www.onmyphd.com/?p=proximal.operator
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator