http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=D35kumar&feedformat=atomstatwiki - User contributions [US]2021-05-09T14:44:40ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fix_your_classifier:_the_marginal_value_of_training_the_last_weight_layer&diff=42025Fix your classifier: the marginal value of training the last weight layer2018-11-30T06:26:39Z<p>D35kumar: /* Shortcomings of the Final Classification Layer and its Solution */ Editorial</p>
<hr />
<div><br />
The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.<br />
<br />
=Introduction=<br />
<br />
Deep neural networks have become widely used for machine learning, achieving state-of-the-art results on many tasks. One of the most common tasks they are used for is classification. For example, convolutional neural networks (CNNs) are used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.<br />
<br />
=Brief Overview=<br />
<br />
In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.<br />
<br />
=Previous Work=<br />
<br />
Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:<br />
<br />
* Weight sharing and specification (Han et al., 2015)<br />
<br />
* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)<br />
<br />
* Low-rank approximations to speed up CNN (Tai et al., 2015)<br />
<br />
* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)<br />
<br />
Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.<br />
<br />
=Background=<br />
<br />
Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.<br />
<br />
== Shortcomings of the Final Classification Layer and its Solution ==<br />
<br />
Zeiler & Fergus, 2014 showed that despite the enormous number of trainable parameters these layers add to the model, they are known to have a rather marginal impact on the final performance of the network.<br />
<br />
It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.<br />
<br />
=Proposed Method=<br />
<br />
The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).<br />
<br />
==Using a Fixed Classifier==<br />
<br />
Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.<br />
<br />
In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.<br />
<br />
For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×<br />
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation<br />
<br />
<math><br />
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }<br />
<br />
and reducing the expected negative log likelihood with respect to ground-truth target <math> t &isin; </math> { <math> {1, . . . , C} </math> },<br />
by minimizing the loss function:<br />
<br />
<math><br />
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})<br />
</math><br />
<br />
where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.<br />
<br />
==Choosing the Projection Matrix==<br />
<br />
To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q &isin; R^{N×C} </math>, such that <math> &forall; i &ne; j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition<br />
<br />
As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial<br />
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:<br />
<br />
<center><math><br />
\hat{x} = \frac{x}{||x||_{2}}<br />
</math></center><br />
<br />
This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b &isin; \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:<br />
<br />
<center><br />
<math><br />
v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; </math> { <math> {1,...,C} </math>}<br />
</center><br />
<br />
and the loss to be minimized is:<br />
<br />
<center><math><br />
L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i)))<br />
</math></center><br />
<br />
where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t &isin; </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in<br />
[[Media: figure1_log_behave.png| Figure 1]].<br />
<br />
<center>[[File:figure1_log_behave.png]]</center><br />
<br />
When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is <br />
<br />
<center>[[File:caloss.png]]</center><br />
<br />
But its final validation accuracy has slight decrease, compared to original models.<br />
<br />
==Using a Hadmard Matrix==<br />
<br />
To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.<br />
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} &isin; </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:<br />
<br />
<center><math><br />
y = \hat{H} \hat{x} + b<br />
</math></center><br />
<br />
This usage allows two main benefits:<br />
* A deterministic, low-memory and easily generated matrix that can be used for classification.<br />
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.<br />
<br />
Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.<br />
<br />
=Experimental Results=<br />
<br />
The authors have evaluated their proposed model on the following datasets:<br />
<br />
==CIFAR-10/100==<br />
<br />
===About the Dataset===<br />
<br />
CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.<br />
<br />
===Training Details===<br />
<br />
The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].<br />
<br />
<center>[[File: figure1_resnet_cifar10.png]]</center><br />
<br />
These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors' conjecture is that with the new fixed parameterization, the network can no longer increase the norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.<br />
<br />
The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.<br />
<br />
<center>[[File: figure3_alpha_resnet_cifar.png]]</center><br />
<br />
The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.<br />
<br />
==IMAGENET==<br />
<br />
===About the Dataset===<br />
<br />
The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.<br />
<br />
===Experiment Details===<br />
<br />
The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.<br />
<br />
For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy. Interestingly, this method allowed Imagenet training in an under-specified regime, where there are<br />
more training samples than number of parameters. This is an unconventional regime for modern deep networks, which are usually over-specified to have many more parameters than training samples (Zhang et al., 2017a).<br />
<br />
The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].<br />
<br />
<center>[[File: table1_fixed_results.png]]</center><br />
<br />
==Language Modelling==<br />
<br />
Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.<br />
<br />
<center>[[File: language.png]]</center><br />
<br />
=Discussion=<br />
<br />
==Implications and Use Cases==<br />
<br />
With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.<br />
<br />
==Possible Caveats==<br />
<br />
The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.<br />
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.<br />
<br />
Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.<br />
<br />
==Future Work==<br />
<br />
<br />
The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.<br />
<br />
Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.<br />
<br />
A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]<br />
<br />
=Conclusion=<br />
<br />
In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.<br />
<br />
Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.<br />
<br />
=Critique=<br />
<br />
The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).<br />
<br />
Moreover, one of the main intuitions of the paper has introduced to be computational cost but it has left out to compare a fixed and learned classifier based on the computational cost and then investigate whether it worth the drop in performance or not considering the fact that not always the output can be degraded because of need for speed! At least a discussion on this issue is expected.<br />
<br />
On the other hand, the computational cost and performance change after fixation of classifier could be related to dataset and the nature and complexity of it. Mostly, having 1000 classes makes the classification more crucial than 2 classes. An evaluation on this topic is also needed.<br />
<br />
=References=<br />
<br />
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.<br />
<br />
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.<br />
<br />
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.<br />
<br />
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.<br />
<br />
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.<br />
<br />
Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.<br />
<br />
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.<br />
<br />
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.<br />
<br />
A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6<br />
(6):1184–1238, 1978.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.<br />
<br />
Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.<br />
<br />
Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.<br />
<br />
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.<br />
<br />
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.<br />
<br />
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.<br />
<br />
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.<br />
<br />
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.<br />
<br />
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.<br />
<br />
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.<br />
<br />
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.<br />
<br />
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.<br />
<br />
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.<br />
<br />
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.<br />
<br />
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.<br />
<br />
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.<br />
<br />
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.<br />
<br />
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.<br />
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.<br />
<br />
Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,<br />
pp. 157, 2017.<br />
<br />
Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.<br />
<br />
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.<br />
<br />
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.<br />
<br />
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.<br />
<br />
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.<br />
<br />
Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.<br />
<br />
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.<br />
<br />
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.<br />
<br />
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.<br />
<br />
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.<br />
<br />
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.<br />
<br />
Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.<br />
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.<br />
<br />
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.<br />
<br />
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.<br />
<br />
Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.<br />
<br />
A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fix_your_classifier:_the_marginal_value_of_training_the_last_weight_layer&diff=42023Fix your classifier: the marginal value of training the last weight layer2018-11-30T06:23:31Z<p>D35kumar: /* Introduction */ (Editorial)</p>
<hr />
<div><br />
The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.<br />
<br />
=Introduction=<br />
<br />
Deep neural networks have become widely used for machine learning, achieving state-of-the-art results on many tasks. One of the most common tasks they are used for is classification. For example, convolutional neural networks (CNNs) are used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.<br />
<br />
=Brief Overview=<br />
<br />
In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.<br />
<br />
=Previous Work=<br />
<br />
Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:<br />
<br />
* Weight sharing and specification (Han et al., 2015)<br />
<br />
* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)<br />
<br />
* Low-rank approximations to speed up CNN (Tai et al., 2015)<br />
<br />
* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)<br />
<br />
Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.<br />
<br />
=Background=<br />
<br />
Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.<br />
<br />
== Shortcomings of the Final Classification Layer and its Solution ==<br />
<br />
Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).<br />
<br />
It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.<br />
<br />
=Proposed Method=<br />
<br />
The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).<br />
<br />
==Using a Fixed Classifier==<br />
<br />
Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.<br />
<br />
In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.<br />
<br />
For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×<br />
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation<br />
<br />
<math><br />
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }<br />
<br />
and reducing the expected negative log likelihood with respect to ground-truth target <math> t &isin; </math> { <math> {1, . . . , C} </math> },<br />
by minimizing the loss function:<br />
<br />
<math><br />
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})<br />
</math><br />
<br />
where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.<br />
<br />
==Choosing the Projection Matrix==<br />
<br />
To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q &isin; R^{N×C} </math>, such that <math> &forall; i &ne; j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition<br />
<br />
As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial<br />
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:<br />
<br />
<center><math><br />
\hat{x} = \frac{x}{||x||_{2}}<br />
</math></center><br />
<br />
This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b &isin; \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:<br />
<br />
<center><br />
<math><br />
v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; </math> { <math> {1,...,C} </math>}<br />
</center><br />
<br />
and the loss to be minimized is:<br />
<br />
<center><math><br />
L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i)))<br />
</math></center><br />
<br />
where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t &isin; </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in<br />
[[Media: figure1_log_behave.png| Figure 1]].<br />
<br />
<center>[[File:figure1_log_behave.png]]</center><br />
<br />
When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is <br />
<br />
<center>[[File:caloss.png]]</center><br />
<br />
But its final validation accuracy has slight decrease, compared to original models.<br />
<br />
==Using a Hadmard Matrix==<br />
<br />
To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.<br />
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} &isin; </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:<br />
<br />
<center><math><br />
y = \hat{H} \hat{x} + b<br />
</math></center><br />
<br />
This usage allows two main benefits:<br />
* A deterministic, low-memory and easily generated matrix that can be used for classification.<br />
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.<br />
<br />
Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.<br />
<br />
=Experimental Results=<br />
<br />
The authors have evaluated their proposed model on the following datasets:<br />
<br />
==CIFAR-10/100==<br />
<br />
===About the Dataset===<br />
<br />
CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.<br />
<br />
===Training Details===<br />
<br />
The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].<br />
<br />
<center>[[File: figure1_resnet_cifar10.png]]</center><br />
<br />
These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors' conjecture is that with the new fixed parameterization, the network can no longer increase the norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.<br />
<br />
The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.<br />
<br />
<center>[[File: figure3_alpha_resnet_cifar.png]]</center><br />
<br />
The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.<br />
<br />
==IMAGENET==<br />
<br />
===About the Dataset===<br />
<br />
The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.<br />
<br />
===Experiment Details===<br />
<br />
The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.<br />
<br />
For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy. Interestingly, this method allowed Imagenet training in an under-specified regime, where there are<br />
more training samples than number of parameters. This is an unconventional regime for modern deep networks, which are usually over-specified to have many more parameters than training samples (Zhang et al., 2017a).<br />
<br />
The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].<br />
<br />
<center>[[File: table1_fixed_results.png]]</center><br />
<br />
==Language Modelling==<br />
<br />
Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.<br />
<br />
<center>[[File: language.png]]</center><br />
<br />
=Discussion=<br />
<br />
==Implications and Use Cases==<br />
<br />
With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.<br />
<br />
==Possible Caveats==<br />
<br />
The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.<br />
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.<br />
<br />
Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.<br />
<br />
==Future Work==<br />
<br />
<br />
The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.<br />
<br />
Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.<br />
<br />
A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]<br />
<br />
=Conclusion=<br />
<br />
In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.<br />
<br />
Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.<br />
<br />
=Critique=<br />
<br />
The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).<br />
<br />
Moreover, one of the main intuitions of the paper has introduced to be computational cost but it has left out to compare a fixed and learned classifier based on the computational cost and then investigate whether it worth the drop in performance or not considering the fact that not always the output can be degraded because of need for speed! At least a discussion on this issue is expected.<br />
<br />
=References=<br />
<br />
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.<br />
<br />
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.<br />
<br />
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.<br />
<br />
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.<br />
<br />
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.<br />
<br />
Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.<br />
<br />
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.<br />
<br />
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.<br />
<br />
A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6<br />
(6):1184–1238, 1978.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.<br />
<br />
Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.<br />
<br />
Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.<br />
<br />
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.<br />
<br />
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.<br />
<br />
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.<br />
<br />
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.<br />
<br />
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.<br />
<br />
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.<br />
<br />
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.<br />
<br />
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.<br />
<br />
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.<br />
<br />
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.<br />
<br />
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.<br />
<br />
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.<br />
<br />
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.<br />
<br />
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.<br />
<br />
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.<br />
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.<br />
<br />
Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,<br />
pp. 157, 2017.<br />
<br />
Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.<br />
<br />
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.<br />
<br />
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.<br />
<br />
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.<br />
<br />
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.<br />
<br />
Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.<br />
<br />
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.<br />
<br />
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.<br />
<br />
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.<br />
<br />
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.<br />
<br />
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.<br />
<br />
Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.<br />
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.<br />
<br />
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.<br />
<br />
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.<br />
<br />
Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.<br />
<br />
A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs&diff=41663stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs2018-11-27T23:33:21Z<p>D35kumar: /* Critique/Future Work */ Removing the incorrect critique which it was added by someone else</p>
<hr />
<div>== Introduction ==<br />
The main reason behind the recent success of Long Short-Term Memory Networks (LSTM) and deep neural networks has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. It is not always possible to know how the prediction was made, where it came from and how to understand the workings underneath. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition (CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model, with a core focus of this work being the explainability of a prediction. <br />
<br />
<br />
Contextual Decomposition is the method introduced in this paper. It extracts information about which words contributed the maximum and minimum amounts towards LSTM prediction, and also how they were combined in order to yield the final prediction. The LSTM output is mathematically decomposed and the contributions are disambiguated at each step by different parts of the sentence. In the application domain, this paper shows how the contextual decomposition method is used to successfully extract positive and negative negations from an LSTM. This paper also shows that the prior interpretation methods have document-level<br />
information built into them in complex, unspecified ways. For example, in the prior work, strongly negative phrases contained within positive reviews are viewed as neutral, or even positive.<br />
<br />
==Intuition of the paper==<br />
<br />
If we consider a sentence in an Amazon review stating "This bag was good" it is very clearly a positive sentence, where the word "good" contributes maximum towards the positivity. On the other hand, if the review reads "This bag was not good" this becomes a negative review. Note that the negative review is not highly influenced by any individual word; most of the influence stems from an interaction between two words "not" and "good". This interaction modleled by the authors gives them an extra degree of freedom. Thus, they can have a better interpretation of the model. In this paper, the authors focus on this interaction between words for studying model efficiency and explainability. (Reference, author's talk in: https://www.youtube.com/watch?v=GjpGAyJenCM).<br />
<br />
==Overview of previous work==<br />
<br />
There has been research towards developing methods to understand the evaluations provided by LSTMs. Some of them are in line with the work done in this particular paper while others have followed some different approaches.<br />
<br />
Approaches similar to the one provided in this paper - All of the approaches in this category have tried to look into computing just the word-level importance scores with varying evaluation methods.<br />
#Murdoch & Szlam (2017): introduced a decomposition of the LSTM's output embedding and learned the importance score of certain words and phrases from those words (sum of the importance of words). A classifier is then built that searches for these learned phrases that are important and predicts the associated class which is then compared with the output of the LSTMs for validation.<br />
#Li et al. (2016): (Leave one out) They observed the change in log probability of a function by replacing a word vector (with zero) and studying the change in the prediction of the LSTM. It is completely anecdotal. Additionally provided an RL model to find the minimal set of words that must be erased to change the model's decision (although this is irrelevant to interpretability).<br />
#Sundararajan et al. 2017 (Integrated Gradients): a general gradient based technique to learn importance evaluated theoretically and empirically. Built up as an improvement to methods which were trying to do something quite similar. It is tested on image, text and chemistry models.<br />
<br />
Decomposition-based approaches for CNN:<br />
#Bach et al. 2015: proposed a solution to the problem of understanding classification decisions of CNNs by pixel-wise decomposition. Pixel contributions are visualized as heat maps.<br />
#Shrikumar et al. 2016 (DeepLift): an algorithm based on a method similar to backpropagation to learn the importance score for the inputs for a given output. The algorithm to learn these input scores is not dependent on the gradient therefore learning can also happen if the gradient is zero during backpropagation.<br />
<br />
Focussing on analyzing the gate activations:<br />
#Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. For eg. a cell being activated for keeping track of open parathesis or quotes.<br />
#Strobelt et al. 2016: built a visual tool for understanding and analyzing raw gate activations.<br />
<br />
Attention-based models:<br />
#Bahdanau et al. (2014): These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. But the problem with it is twofold. Firstly, it is only an indirect indicator of importance and does not provide directionality. Secondly, they have not been evaluated empirically or otherwise as an interpretation technique. Although they have been used in multiple other applications and architectures for solving a variety of problems.<br />
<br />
==Long Short-Term Memory Networks==<br />
Over the past few years, LSTM has become a core component of neural NLP systems and sequence modeling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference is that the modules are little more complicated. Instead of having a single tanh layer like in an RNN, they have four (called gates), interacting in a special way. Additionally, they have a cell state which runs through the entire chain of the network. It helps in managing the information from the previous cells in the chain.<br />
<br />
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.<br />
<br />
\begin{align}<br />
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o)<br />
\end{align}<br />
\begin{align}<br />
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f)<br />
\end{align}<br />
\begin{align}<br />
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i)<br />
\end{align}<br />
\begin{align}<br />
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g)<br />
\end{align}<br />
\begin{align}<br />
c_t = f_t \odot c_{t−1} + i_t \odot g_t<br />
\end{align}<br />
\begin{align}<br />
h_t = o_t \odot tanh(c_t)<br />
\end{align}<br />
<br />
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and <math> \odot </math> denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication. Intuitively we can think of the forget gate as how much previous memory(information) do we want to forget; input gate as controlling whether or not to let new input in; g gate controlling what do we want to add and finally the output gate as controlling how much the current information(at current time step) should flow out.<br />
<br />
A visualization of an LSTM can be seen below (Reference, Nvidia Corporation: Long Short-Term Memory (LSTM)). Note that both the input and recurrent elements are seen 'entering' the cell in various locations. The output, and recurrent elements, can be seen 'exiting' the cell at the top of the image.<br />
<br />
[[File: WF_SecCont_01_lstm.png|600px|center]]<br />
<br />
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.<br />
<br />
\begin{align}<br />
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) }<br />
\end{align}<br />
<br />
==Contextual Decomposition(CD) of LSTM==<br />
CD decomposes the output of the LSTM into a sum of two contributions:<br />
# those resulting solely from the given phrase<br />
# those involving other factors<br />
<br />
One important thing that is crucial to understand is that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence and returns the vectors with the contributions.<br />
<br />
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.<br />
<br />
\begin{align}<br />
h_t = \beta_t + \gamma_t<br />
\end{align}<br />
\begin{align}<br />
c_t = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
In the decomposition <math>\beta_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase while <math> \gamma_t </math> denotes contribution atleast in part from the other factors. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and atleast in part from the other factors respectively.<br />
<br />
Using this decomposition the softmax function can be represented as follows<br />
\begin{align}<br />
p = SoftMax(W\beta_T + W\gamma_T)<br />
\end{align}<br />
<br />
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.<br />
<br />
===Disambiguation Interaction between gates===<br />
<br />
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.<br />
<br />
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have<br />
\begin{align}<br />
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\<br />
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)<br />
\end{align}<br />
<br />
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).<br />
<br />
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is contained within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>). In this summary, we will derive the equations for the former case. A very important thing to understand is that given any word/phrase within a sentence this algorithm would make a full pass over the LSTM to compute the 2 different contributions.<br />
<br />
So essentially what we are going to do now is linearize each of the gates, then expand the product of sums of these gates and then finally group the terms we get depending on which type of interaction they represent (solely from phase, solely from other factors and a combination of both).<br />
<br />
Terms are determined to be derived solely from the specific phrase if they involve products from some combination of <math>\beta_{t-1}, \beta_{t-1}^c, x_t</math> and <math> b_i </math> or <math> b_g </math>(but not both). In the other case when t is not within the phrase, products involving <math> x_t </math> are treated as not deriving from the phrase. This(the other case) can be observed by seeing the equations for this specific case in the appendix of the original paper.<br />
<br />
\begin{align}<br />
f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\<br />
& = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\<br />
& = \beta_t^f + \gamma_t^f<br />
\end{align}<br />
<br />
Similarly<br />
<br />
\begin{align}<br />
i_t\odot g_t &= [(L_\sigma(W_ix_t) + L_\sigma(V_i\beta_{t-1}) + L_\sigma(V_i\gamma_{t-1}) + L_\sigma(b_i))] \\<br />
& \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(V_g\gamma_{t-1}) + L_{tanh}(b_g))] \\<br />
& = ([L_\sigma(W_ix_t) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_i\beta_{t-1}) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(b_i) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1})] \\ <br />
&+ [L_\sigma(V_i\gamma_{t-1}) \odot g_t + i_t \odot L_{tanh}(V_g\gamma_{t-1}) - L_\sigma(V_i\gamma_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1}) \\ <br />
&+ L_\sigma(b_i) \odot L_{tanh}(b_g)] \\<br />
& = \beta_t^u + \gamma_t^u<br />
\end{align}<br />
<br />
Thus we can represent <math>c_t</math> as<br />
<br />
\begin{align}<br />
c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\<br />
& = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\<br />
& = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
So once we have the decomposition of <math> c_t </math>, then we can rather simply calculate the transformation of <math> h_t </math> by linearizing the <math> tanh</math> function. Again at this point, we just assume that a linearizing function for <math> tanh </math> exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally <math> h_t </math> can be written as<br />
<br />
\begin{align}<br />
h_t &= o_t \odot tanh(c_t) \\<br />
& = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\<br />
& = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\<br />
& = \beta_t + \gamma_t<br />
\end{align}<br />
<br />
===Linearizing activation functions ===<br />
<br />
This section will explain the big assumption that we took earlier about the linearizing functions <math> L_{\sigma} </math> and <math> L_{tanh} </math>. For arbitrary { <math> y_1, ..., y_N </math> } <math> \in R </math>, the problem that we intend to solve is essentially<br />
\begin{align}<br />
tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i)<br />
\end{align}<br />
<br />
In cases where {<math> y_i </math>} follow a natural ordering, work in Murdoch & Szlam, 2017 where the difference of partial sums is utilized as a linearization technique could be used. This could be shown by the equation below<br />
\begin{align}<br />
L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j)<br />
\end{align}<br />
<br />
But in our case the terms do not follow any particular ordering, for e.g. while calculating <math> i_t </math> we could write it as a sum of <math> W_ix_t, V_ih_{t−1}, b_i </math> or <math> b_i, V_ih_{t−1}, W_ix_t </math>. Thus, we average over all the possible orderings. Let <math>\pi_i, ..., \pi_{M_n} </math> denote the set of all permutations of <math>1, ..., N</math>, then the score could be given as below<br />
<br />
\begin{align}<br />
L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k) - 1}y_{\pi_i(j)})]<br />
\end{align}<br />
<br />
We can similarly derive <math> L_{\sigma} </math>. An important empirical observation to note here is that in the case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term.<br />
In our case, the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example of a case where N=2 is shown below.<br />
<br />
\begin{align}<br />
L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)])<br />
\end{align}<br />
<br />
==Experiments==<br />
As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper verifies the following 3 tasks with the experiments:<br />
# It should work on the standard problem of word-level importance scores<br />
# It should behave for words as well as phrases especially in situations involving compositionality.<br />
# It should be able to extract instances of positive and negative negation.<br />
<br />
An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The code can be found at "https://github.com/jamie-murdoch/ContextualDecomposition". The model was trained using Adam with the learning rate of 0.001 and using early stopping on the validation set. Additionally, a bag of words linear model was used.<br />
<br />
All the experiments were performed on the Stanford Sentiment Treebank(SST) [Socher et al., 2013] dataset and the Yelp Polarity(YP) [Zhang et al., 2015] dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. It is important to note that the SST dataset has one key feature that is perfect for this task, which is in addition to review-level labels, <br />
it also provides labels for each phrase in the binarized constituency parse tree. This enables us to examine that if the model can identify negative phrases out of a positive review, or vice versa. The word embedding used in LSTM is pretrained Glove vectors with length equal to 300, and the hidden representations of the LSTM is set to be 168. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In the case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.<br />
<br />
<br />
===Baselines===<br />
The interpretations are compared with 4 state-of-the-art baselines for interpretability.<br />
# Cell Decomposition(Murdoch & Szlam, 2017), <br />
# Integrated Gradients (Sundararajan et al., 2017),<br />
# Leave One Out (Li et al., 2016),<br />
# Gradient times input [gradient of the output probability with respect to the word embeddings is computed which is finally reported as a dot product with the word vector]<br />
<br />
To obtain phrase scores for word-based baselines integrated gradients, cell decomposition, and gradients, the paper sums the scores of the words contained within the phrase.<br />
<br />
<br />
===Unigram(Word) Scores===<br />
Logistic regression(LR) coefficients while being sufficiently accurate for prediction are considered the gold standard for interpretability. For the task of sentiment analysis, the importance of the words is given by their coefficient values. Thus we would expect the CD scores extracted from an LSTM, to have meaningful relationships and comparison with the logistic regression coefficients. This comparison is done using scatter plots(Fig 4) which measures the Pearson correlation coefficient between the importance scores extracted by LR coefficients and LSTM. This is done for multiple words which are represented as a point in the scatter plots. For SST, CD and Integrated Gradients, with correlations of 0.76 and 0.72, respectively, are substantially better than other methods, which have correlations of at most 0.51. On Yelp, the gap is not as big, but CD is still very competitive, having correlation 0.52 with other methods ranging from 0.34 to 0.56. The complete results are shown in Table 4.<br />
[[File:Dhruv Table4.png|600px|centre]]<br />
<br />
===Benefits===<br />
Having verified reasonably strong results in the base case, the paper then proceeds to show the benefits of CD.<br />
====Identifying Dissenting Subphrases====<br />
First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase(a phrase is considered to be of at most 5 words) with different sentiments. For example, consider the phrase "used to be my favorite". The word "favorite" is strongly positive which is also shown by it having a high linear regression coefficient. Nonetheless, the existing methods identify "favorite" as being highly negative or neutral in this context. However, as shown in table 1 CD is able to correctly identify it being strongly positive, and the subphrase "used to be" as highly negative. This particular identification is itself the main reason for using the LSTMs over other methods in text comprehension. Thus, it is quite important that an interpretation algorithm is able to properly uncover how these interactions are being handled. A search across the datasets is done to find similar cases where a negative phrase contains a positive sub-phrase and vice versa. Phrases are scored using the logistic regression over n-gram features and included if their overall score is over 1.5.<br />
[[File:Dhruv Table1.png|600px|centre]]<br />
It is to be noted that for an efficient interpretation algorithm the distribution of scores for these positive and negative dissenting subphrases should be significantly separate with the positive subphrases having positive scores and vice-versa. However, as shown in figure 2, this is not the case with the previous interpretation algorithms.<br />
<br />
====Examining High-Level Compositionality====<br />
The paper now studies the cases where a sizable portion of a review(between one and two-thirds) has a different polarity from the final sentiment. An example is shown in Table 2. SST contains phrase-level sentiment labels too. Therefore the authors conduct a search in SST where a sizable phrase in a review is of the opposite polarity than the review-level label. The figure shows the distribution of the resulting positive and negative phrases for different attribution methods. We should note that a successful interpretation method would have a sizeable gap between these two distributions. Notice that the previous methods fail to satisfy this criterion. The paper additionally provides a two-sample Kolmogorov-Smirnov one-sided test statistic, to quantify this difference in performance. This statistic is a common difference measure for the difference of distributions with values ranging from 0 to 1. As shown in Figure 3 CD gets a score of 0.74 while the other models achieve a score of 0(Cell decomposition), 0.33(Integrated Gradients), 0.58(Leave One Out) and 0.61(gradient). The methods leave one out and gradient perform relatively better than the other 2 baselines but they were the weakest performers in the unigram scores. This inconsistency in other methods performance further strengthens the superiority of CD.<br />
[[File:Dhruv Table2.png|600px|centre]]<br />
<br />
====Capturing Negation====<br />
The paper also shows a way to empirically show how the LSTMs capture negations in a sentence. To search for negations, the following list of negation words were used: not, n’t, lacks, nobody, nor, nothing, neither, never, none, nowhere, remotely. Again using the phrase labels present in SST, the authors search over the training set for instances of negation. Both the positive as well as negative negations are identified. For a given negation phrase, they extract a negation interaction by computing the CD score of the entire phrase and subtracting the CD scores of the phrase being negated and the negation term itself. The resulting score can be interpreted as an n-gram feature. Apart from CD, only leave one out is capable of producing such interaction scores. The distribution of extracted scores is presented in Figure 1. For CD there is a clear distinction between positive and negative negations. Leave one out is able to capture some of the interactions, but has a noticeable overlap between positive and negative negations around zero, indicating a high rate of false negatives.<br />
<br />
====Identifying Similar Phrases====<br />
A key aspect of the CD algorithm is that it helps us learn the value of <math> \beta_t </math> which is essentially a dense embedding vector for a word or a phrase. The way the authors did in the paper is to calculate the AVG(<math> \beta_t </math> ), and then, using similarity measures in the embedding space(eg. cosine similarity) we can easily find similar phrases/words given a phrase/word. The results as shown in Table 3 are qualitatively sensible for 3 different kinds of interactions: positive negation, negative negation and modification, as well as positive and negative words.<br />
[[File:Dhruv Table3.png|600px|centre]]<br />
<br />
==Conclusions==<br />
The paper provides an algorithm called Contextual Decomposition(CD) to interpret predictions made by LSTMs without modifying their architecture. It takes in a trained LSTM and breaks it down in components and quantifies the interpretability of its decision. In both NLP and in general applications CD produces importance scores for words (single variables in general), phrases (several variables together) and word interactions (variable interactions). It also compares the algorithm with state-of-the-art baselines and shows that it performs favorably. It also shows that CD is capable of identifying phrases of varying sentiment and extracting meaningful word (or variable) interactions. It shows the shortcomings of the traditional word-based interpretability approaches for understanding LSTMs and advances the state-of-the-art.<br />
<br />
==Critique/Future Work==<br />
While the method itself is novel in the sense that it moves past the traditional approach of looking just at word level importance scores; it only looks at one specific architecture which is applied to a very simple problem. The authors don't talk about any future directions for this work in the paper itself but a discussion about it happened during the Oral presentation of the paper at ICLR 2018. Following are the important points:<br />
#We could look at interpreting a more complex model, for example, say seq2seq. The author pointed out that he was affirmative that this model could be extended for such purposes although the computational complexity would increase since we would be predicting multiple outputs in this case.<br />
#We could also look at whether this approach could be generalized to completely different architectures like CNN. A later related approach attempted to interpret Neural Networks With Nearest Neighbors to provide a metric that helps to create feature importance values (Reference, Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors). As of now given a new model, we need to manually work out the math for the specific model. Could we develop some general approach towards this? Although the author pointed out that they are working towards using this approach to interpret CNNs.<br />
* It would be an exciting prospect for future work to compare the output of the algorithms with human given scores on a small subset of words.<br />
<br />
==References==<br />
W. James Murdoch, Peter J. Liu, Bin Yu. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. ICLR 2018<br />
<br />
Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural Computation, 9(8): 1735–1780, 1997.<br />
<br />
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.<br />
<br />
Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016. URL http://arxiv.org/abs/1612.08220.<br />
<br />
W James Murdoch and Arthur Szlam. Automatic rule extraction from long short-term memory networks. ICLR, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.<br />
<br />
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017<br />
<br />
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.<br />
<br />
Hendrik Strobelt, Sebastian Gehrmann, Bernd Huber, Hanspeter Pfister, and Alexander M Rush. Visual analysis of hidden state dynamics in recurrent neural networks. arXiv preprint arXiv:1606.07461, 2016.<br />
<br />
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017. URL http://arxiv.org/abs/1703.01365<br />
<br />
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657, 2015<br />
<br />
Jamie Murdoch, Beyond Word Importance: Contextual Decomposition for Interpreting LSTMs. https://www.youtube.com/watch?v=GjpGAyJenCM<br />
<br />
Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors, URL https://arxiv.org/abs/1809.02847<br />
<br />
Nvidia Corporation: Long Short-Term Memory (LSTM), URL https://developer.nvidia.com/discover/lstm, Accessed: October 21, 2018<br />
<br />
==Appendix==<br />
<br />
[[File:Dhruv Figure4.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure2.png|600px|right]]<br />
<br />
<br />
[[File:Dhruv Figure3.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure1.png|600px|right]]</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs&diff=41662stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs2018-11-27T23:31:52Z<p>D35kumar: Undo revision 41661 by D35kumar (talk)</p>
<hr />
<div>== Introduction ==<br />
The main reason behind the recent success of Long Short-Term Memory Networks (LSTM) and deep neural networks has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. It is not always possible to know how the prediction was made, where it came from and how to understand the workings underneath. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition (CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model, with a core focus of this work being the explainability of a prediction. <br />
<br />
<br />
Contextual Decomposition is the method introduced in this paper. It extracts information about which words contributed the maximum and minimum amounts towards LSTM prediction, and also how they were combined in order to yield the final prediction. The LSTM output is mathematically decomposed and the contributions are disambiguated at each step by different parts of the sentence. In the application domain, this paper shows how the contextual decomposition method is used to successfully extract positive and negative negations from an LSTM. This paper also shows that the prior interpretation methods have document-level<br />
information built into them in complex, unspecified ways. For example, in the prior work, strongly negative phrases contained within positive reviews are viewed as neutral, or even positive.<br />
<br />
==Intuition of the paper==<br />
<br />
If we consider a sentence in an Amazon review stating "This bag was good" it is very clearly a positive sentence, where the word "good" contributes maximum towards the positivity. On the other hand, if the review reads "This bag was not good" this becomes a negative review. Note that the negative review is not highly influenced by any individual word; most of the influence stems from an interaction between two words "not" and "good". This interaction modleled by the authors gives them an extra degree of freedom. Thus, they can have a better interpretation of the model. In this paper, the authors focus on this interaction between words for studying model efficiency and explainability. (Reference, author's talk in: https://www.youtube.com/watch?v=GjpGAyJenCM).<br />
<br />
==Overview of previous work==<br />
<br />
There has been research towards developing methods to understand the evaluations provided by LSTMs. Some of them are in line with the work done in this particular paper while others have followed some different approaches.<br />
<br />
Approaches similar to the one provided in this paper - All of the approaches in this category have tried to look into computing just the word-level importance scores with varying evaluation methods.<br />
#Murdoch & Szlam (2017): introduced a decomposition of the LSTM's output embedding and learned the importance score of certain words and phrases from those words (sum of the importance of words). A classifier is then built that searches for these learned phrases that are important and predicts the associated class which is then compared with the output of the LSTMs for validation.<br />
#Li et al. (2016): (Leave one out) They observed the change in log probability of a function by replacing a word vector (with zero) and studying the change in the prediction of the LSTM. It is completely anecdotal. Additionally provided an RL model to find the minimal set of words that must be erased to change the model's decision (although this is irrelevant to interpretability).<br />
#Sundararajan et al. 2017 (Integrated Gradients): a general gradient based technique to learn importance evaluated theoretically and empirically. Built up as an improvement to methods which were trying to do something quite similar. It is tested on image, text and chemistry models.<br />
<br />
Decomposition-based approaches for CNN:<br />
#Bach et al. 2015: proposed a solution to the problem of understanding classification decisions of CNNs by pixel-wise decomposition. Pixel contributions are visualized as heat maps.<br />
#Shrikumar et al. 2016 (DeepLift): an algorithm based on a method similar to backpropagation to learn the importance score for the inputs for a given output. The algorithm to learn these input scores is not dependent on the gradient therefore learning can also happen if the gradient is zero during backpropagation.<br />
<br />
Focussing on analyzing the gate activations:<br />
#Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. For eg. a cell being activated for keeping track of open parathesis or quotes.<br />
#Strobelt et al. 2016: built a visual tool for understanding and analyzing raw gate activations.<br />
<br />
Attention-based models:<br />
#Bahdanau et al. (2014): These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. But the problem with it is twofold. Firstly, it is only an indirect indicator of importance and does not provide directionality. Secondly, they have not been evaluated empirically or otherwise as an interpretation technique. Although they have been used in multiple other applications and architectures for solving a variety of problems.<br />
<br />
==Long Short-Term Memory Networks==<br />
Over the past few years, LSTM has become a core component of neural NLP systems and sequence modeling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference is that the modules are little more complicated. Instead of having a single tanh layer like in an RNN, they have four (called gates), interacting in a special way. Additionally, they have a cell state which runs through the entire chain of the network. It helps in managing the information from the previous cells in the chain.<br />
<br />
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.<br />
<br />
\begin{align}<br />
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o)<br />
\end{align}<br />
\begin{align}<br />
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f)<br />
\end{align}<br />
\begin{align}<br />
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i)<br />
\end{align}<br />
\begin{align}<br />
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g)<br />
\end{align}<br />
\begin{align}<br />
c_t = f_t \odot c_{t−1} + i_t \odot g_t<br />
\end{align}<br />
\begin{align}<br />
h_t = o_t \odot tanh(c_t)<br />
\end{align}<br />
<br />
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and <math> \odot </math> denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication. Intuitively we can think of the forget gate as how much previous memory(information) do we want to forget; input gate as controlling whether or not to let new input in; g gate controlling what do we want to add and finally the output gate as controlling how much the current information(at current time step) should flow out.<br />
<br />
A visualization of an LSTM can be seen below (Reference, Nvidia Corporation: Long Short-Term Memory (LSTM)). Note that both the input and recurrent elements are seen 'entering' the cell in various locations. The output, and recurrent elements, can be seen 'exiting' the cell at the top of the image.<br />
<br />
[[File: WF_SecCont_01_lstm.png|600px|center]]<br />
<br />
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.<br />
<br />
\begin{align}<br />
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) }<br />
\end{align}<br />
<br />
==Contextual Decomposition(CD) of LSTM==<br />
CD decomposes the output of the LSTM into a sum of two contributions:<br />
# those resulting solely from the given phrase<br />
# those involving other factors<br />
<br />
One important thing that is crucial to understand is that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence and returns the vectors with the contributions.<br />
<br />
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.<br />
<br />
\begin{align}<br />
h_t = \beta_t + \gamma_t<br />
\end{align}<br />
\begin{align}<br />
c_t = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
In the decomposition <math>\beta_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase while <math> \gamma_t </math> denotes contribution atleast in part from the other factors. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and atleast in part from the other factors respectively.<br />
<br />
Using this decomposition the softmax function can be represented as follows<br />
\begin{align}<br />
p = SoftMax(W\beta_T + W\gamma_T)<br />
\end{align}<br />
<br />
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.<br />
<br />
===Disambiguation Interaction between gates===<br />
<br />
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.<br />
<br />
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have<br />
\begin{align}<br />
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\<br />
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)<br />
\end{align}<br />
<br />
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).<br />
<br />
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is contained within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>). In this summary, we will derive the equations for the former case. A very important thing to understand is that given any word/phrase within a sentence this algorithm would make a full pass over the LSTM to compute the 2 different contributions.<br />
<br />
So essentially what we are going to do now is linearize each of the gates, then expand the product of sums of these gates and then finally group the terms we get depending on which type of interaction they represent (solely from phase, solely from other factors and a combination of both).<br />
<br />
Terms are determined to be derived solely from the specific phrase if they involve products from some combination of <math>\beta_{t-1}, \beta_{t-1}^c, x_t</math> and <math> b_i </math> or <math> b_g </math>(but not both). In the other case when t is not within the phrase, products involving <math> x_t </math> are treated as not deriving from the phrase. This(the other case) can be observed by seeing the equations for this specific case in the appendix of the original paper.<br />
<br />
\begin{align}<br />
f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\<br />
& = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\<br />
& = \beta_t^f + \gamma_t^f<br />
\end{align}<br />
<br />
Similarly<br />
<br />
\begin{align}<br />
i_t\odot g_t &= [(L_\sigma(W_ix_t) + L_\sigma(V_i\beta_{t-1}) + L_\sigma(V_i\gamma_{t-1}) + L_\sigma(b_i))] \\<br />
& \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(V_g\gamma_{t-1}) + L_{tanh}(b_g))] \\<br />
& = ([L_\sigma(W_ix_t) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_i\beta_{t-1}) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(b_i) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1})] \\ <br />
&+ [L_\sigma(V_i\gamma_{t-1}) \odot g_t + i_t \odot L_{tanh}(V_g\gamma_{t-1}) - L_\sigma(V_i\gamma_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1}) \\ <br />
&+ L_\sigma(b_i) \odot L_{tanh}(b_g)] \\<br />
& = \beta_t^u + \gamma_t^u<br />
\end{align}<br />
<br />
Thus we can represent <math>c_t</math> as<br />
<br />
\begin{align}<br />
c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\<br />
& = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\<br />
& = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
So once we have the decomposition of <math> c_t </math>, then we can rather simply calculate the transformation of <math> h_t </math> by linearizing the <math> tanh</math> function. Again at this point, we just assume that a linearizing function for <math> tanh </math> exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally <math> h_t </math> can be written as<br />
<br />
\begin{align}<br />
h_t &= o_t \odot tanh(c_t) \\<br />
& = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\<br />
& = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\<br />
& = \beta_t + \gamma_t<br />
\end{align}<br />
<br />
===Linearizing activation functions ===<br />
<br />
This section will explain the big assumption that we took earlier about the linearizing functions <math> L_{\sigma} </math> and <math> L_{tanh} </math>. For arbitrary { <math> y_1, ..., y_N </math> } <math> \in R </math>, the problem that we intend to solve is essentially<br />
\begin{align}<br />
tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i)<br />
\end{align}<br />
<br />
In cases where {<math> y_i </math>} follow a natural ordering, work in Murdoch & Szlam, 2017 where the difference of partial sums is utilized as a linearization technique could be used. This could be shown by the equation below<br />
\begin{align}<br />
L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j)<br />
\end{align}<br />
<br />
But in our case the terms do not follow any particular ordering, for e.g. while calculating <math> i_t </math> we could write it as a sum of <math> W_ix_t, V_ih_{t−1}, b_i </math> or <math> b_i, V_ih_{t−1}, W_ix_t </math>. Thus, we average over all the possible orderings. Let <math>\pi_i, ..., \pi_{M_n} </math> denote the set of all permutations of <math>1, ..., N</math>, then the score could be given as below<br />
<br />
\begin{align}<br />
L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k) - 1}y_{\pi_i(j)})]<br />
\end{align}<br />
<br />
We can similarly derive <math> L_{\sigma} </math>. An important empirical observation to note here is that in the case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term.<br />
In our case, the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example of a case where N=2 is shown below.<br />
<br />
\begin{align}<br />
L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)])<br />
\end{align}<br />
<br />
==Experiments==<br />
As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper verifies the following 3 tasks with the experiments:<br />
# It should work on the standard problem of word-level importance scores<br />
# It should behave for words as well as phrases especially in situations involving compositionality.<br />
# It should be able to extract instances of positive and negative negation.<br />
<br />
An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The code can be found at "https://github.com/jamie-murdoch/ContextualDecomposition". The model was trained using Adam with the learning rate of 0.001 and using early stopping on the validation set. Additionally, a bag of words linear model was used.<br />
<br />
All the experiments were performed on the Stanford Sentiment Treebank(SST) [Socher et al., 2013] dataset and the Yelp Polarity(YP) [Zhang et al., 2015] dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. It is important to note that the SST dataset has one key feature that is perfect for this task, which is in addition to review-level labels, <br />
it also provides labels for each phrase in the binarized constituency parse tree. This enables us to examine that if the model can identify negative phrases out of a positive review, or vice versa. The word embedding used in LSTM is pretrained Glove vectors with length equal to 300, and the hidden representations of the LSTM is set to be 168. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In the case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.<br />
<br />
<br />
===Baselines===<br />
The interpretations are compared with 4 state-of-the-art baselines for interpretability.<br />
# Cell Decomposition(Murdoch & Szlam, 2017), <br />
# Integrated Gradients (Sundararajan et al., 2017),<br />
# Leave One Out (Li et al., 2016),<br />
# Gradient times input [gradient of the output probability with respect to the word embeddings is computed which is finally reported as a dot product with the word vector]<br />
<br />
To obtain phrase scores for word-based baselines integrated gradients, cell decomposition, and gradients, the paper sums the scores of the words contained within the phrase.<br />
<br />
<br />
===Unigram(Word) Scores===<br />
Logistic regression(LR) coefficients while being sufficiently accurate for prediction are considered the gold standard for interpretability. For the task of sentiment analysis, the importance of the words is given by their coefficient values. Thus we would expect the CD scores extracted from an LSTM, to have meaningful relationships and comparison with the logistic regression coefficients. This comparison is done using scatter plots(Fig 4) which measures the Pearson correlation coefficient between the importance scores extracted by LR coefficients and LSTM. This is done for multiple words which are represented as a point in the scatter plots. For SST, CD and Integrated Gradients, with correlations of 0.76 and 0.72, respectively, are substantially better than other methods, which have correlations of at most 0.51. On Yelp, the gap is not as big, but CD is still very competitive, having correlation 0.52 with other methods ranging from 0.34 to 0.56. The complete results are shown in Table 4.<br />
[[File:Dhruv Table4.png|600px|centre]]<br />
<br />
===Benefits===<br />
Having verified reasonably strong results in the base case, the paper then proceeds to show the benefits of CD.<br />
====Identifying Dissenting Subphrases====<br />
First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase(a phrase is considered to be of at most 5 words) with different sentiments. For example, consider the phrase "used to be my favorite". The word "favorite" is strongly positive which is also shown by it having a high linear regression coefficient. Nonetheless, the existing methods identify "favorite" as being highly negative or neutral in this context. However, as shown in table 1 CD is able to correctly identify it being strongly positive, and the subphrase "used to be" as highly negative. This particular identification is itself the main reason for using the LSTMs over other methods in text comprehension. Thus, it is quite important that an interpretation algorithm is able to properly uncover how these interactions are being handled. A search across the datasets is done to find similar cases where a negative phrase contains a positive sub-phrase and vice versa. Phrases are scored using the logistic regression over n-gram features and included if their overall score is over 1.5.<br />
[[File:Dhruv Table1.png|600px|centre]]<br />
It is to be noted that for an efficient interpretation algorithm the distribution of scores for these positive and negative dissenting subphrases should be significantly separate with the positive subphrases having positive scores and vice-versa. However, as shown in figure 2, this is not the case with the previous interpretation algorithms.<br />
<br />
====Examining High-Level Compositionality====<br />
The paper now studies the cases where a sizable portion of a review(between one and two-thirds) has a different polarity from the final sentiment. An example is shown in Table 2. SST contains phrase-level sentiment labels too. Therefore the authors conduct a search in SST where a sizable phrase in a review is of the opposite polarity than the review-level label. The figure shows the distribution of the resulting positive and negative phrases for different attribution methods. We should note that a successful interpretation method would have a sizeable gap between these two distributions. Notice that the previous methods fail to satisfy this criterion. The paper additionally provides a two-sample Kolmogorov-Smirnov one-sided test statistic, to quantify this difference in performance. This statistic is a common difference measure for the difference of distributions with values ranging from 0 to 1. As shown in Figure 3 CD gets a score of 0.74 while the other models achieve a score of 0(Cell decomposition), 0.33(Integrated Gradients), 0.58(Leave One Out) and 0.61(gradient). The methods leave one out and gradient perform relatively better than the other 2 baselines but they were the weakest performers in the unigram scores. This inconsistency in other methods performance further strengthens the superiority of CD.<br />
[[File:Dhruv Table2.png|600px|centre]]<br />
<br />
====Capturing Negation====<br />
The paper also shows a way to empirically show how the LSTMs capture negations in a sentence. To search for negations, the following list of negation words were used: not, n’t, lacks, nobody, nor, nothing, neither, never, none, nowhere, remotely. Again using the phrase labels present in SST, the authors search over the training set for instances of negation. Both the positive as well as negative negations are identified. For a given negation phrase, they extract a negation interaction by computing the CD score of the entire phrase and subtracting the CD scores of the phrase being negated and the negation term itself. The resulting score can be interpreted as an n-gram feature. Apart from CD, only leave one out is capable of producing such interaction scores. The distribution of extracted scores is presented in Figure 1. For CD there is a clear distinction between positive and negative negations. Leave one out is able to capture some of the interactions, but has a noticeable overlap between positive and negative negations around zero, indicating a high rate of false negatives.<br />
<br />
====Identifying Similar Phrases====<br />
A key aspect of the CD algorithm is that it helps us learn the value of <math> \beta_t </math> which is essentially a dense embedding vector for a word or a phrase. The way the authors did in the paper is to calculate the AVG(<math> \beta_t </math> ), and then, using similarity measures in the embedding space(eg. cosine similarity) we can easily find similar phrases/words given a phrase/word. The results as shown in Table 3 are qualitatively sensible for 3 different kinds of interactions: positive negation, negative negation and modification, as well as positive and negative words.<br />
[[File:Dhruv Table3.png|600px|centre]]<br />
<br />
==Conclusions==<br />
The paper provides an algorithm called Contextual Decomposition(CD) to interpret predictions made by LSTMs without modifying their architecture. It takes in a trained LSTM and breaks it down in components and quantifies the interpretability of its decision. In both NLP and in general applications CD produces importance scores for words (single variables in general), phrases (several variables together) and word interactions (variable interactions). It also compares the algorithm with state-of-the-art baselines and shows that it performs favorably. It also shows that CD is capable of identifying phrases of varying sentiment and extracting meaningful word (or variable) interactions. It shows the shortcomings of the traditional word-based interpretability approaches for understanding LSTMs and advances the state-of-the-art.<br />
<br />
==Critique/Future Work==<br />
While the method itself is novel in the sense that it moves past the traditional approach of looking just at word level importance scores; it only looks at one specific architecture which is applied to a very simple problem. The authors don't talk about any future directions for this work in the paper itself but a discussion about it happened during the Oral presentation of the paper at ICLR 2018. Following are the important points:<br />
#We could look at interpreting a more complex model, for example, say seq2seq. The author pointed out that he was affirmative that this model could be extended for such purposes although the computational complexity would increase since we would be predicting multiple outputs in this case.<br />
#We could also look at whether this approach could be generalized to completely different architectures like CNN. A later related approach attempted to interpret Neural Networks With Nearest Neighbors to provide a metric that helps to create feature importance values (Reference, Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors). As of now given a new model, we need to manually work out the math for the specific model. Could we develop some general approach towards this? Although the author pointed out that they are working towards using this approach to interpret CNNs.<br />
# The big assumption in DCN networks is that a non-linear function of multiple inputs can be broken down in sum of the non-linear functions applied to the inputs separately. This clearly does not hold for the 1D case, <math>\tanh(0.3+0.7) = 0.7615941559557649 \neq (\tanh(0.3)+ \tanh(0.7))=0.8956803895687545</math> . The authors do not provide justification for why this hold for the high dimensional case. In this paper, the DCN+ model is presented which further builds on this assumption. If the multiple inputs are from different sources (input and previous states), ordering of the terms is important. To combat this, the authors consider all permutations of the inputs and take an average. However, when the model is finalized, the relative positions of the different input sources do not change during execution time. Hence it is not clear why this averaging is needed. Again, the authors do not provide the reasoning for this assumption.<br />
* It would be an exciting prospect for future work to compare the output of the algorithms with human given scores on a small subset of words.<br />
<br />
==References==<br />
W. James Murdoch, Peter J. Liu, Bin Yu. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. ICLR 2018<br />
<br />
Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural Computation, 9(8): 1735–1780, 1997.<br />
<br />
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.<br />
<br />
Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016. URL http://arxiv.org/abs/1612.08220.<br />
<br />
W James Murdoch and Arthur Szlam. Automatic rule extraction from long short-term memory networks. ICLR, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.<br />
<br />
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017<br />
<br />
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.<br />
<br />
Hendrik Strobelt, Sebastian Gehrmann, Bernd Huber, Hanspeter Pfister, and Alexander M Rush. Visual analysis of hidden state dynamics in recurrent neural networks. arXiv preprint arXiv:1606.07461, 2016.<br />
<br />
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017. URL http://arxiv.org/abs/1703.01365<br />
<br />
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657, 2015<br />
<br />
Jamie Murdoch, Beyond Word Importance: Contextual Decomposition for Interpreting LSTMs. https://www.youtube.com/watch?v=GjpGAyJenCM<br />
<br />
Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors, URL https://arxiv.org/abs/1809.02847<br />
<br />
Nvidia Corporation: Long Short-Term Memory (LSTM), URL https://developer.nvidia.com/discover/lstm, Accessed: October 21, 2018<br />
<br />
==Appendix==<br />
<br />
[[File:Dhruv Figure4.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure2.png|600px|right]]<br />
<br />
<br />
[[File:Dhruv Figure3.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure1.png|600px|right]]</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs&diff=41661stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs2018-11-27T23:30:03Z<p>D35kumar: /* Critique/Future Work */ Removing the incorrect critique</p>
<hr />
<div>== Introduction ==<br />
The main reason behind the recent success of Long Short-Term Memory Networks (LSTM) and deep neural networks has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. It is not always possible to know how the prediction was made, where it came from and how to understand the workings underneath. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition (CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model, with a core focus of this work being the explainability of a prediction. <br />
<br />
<br />
Contextual Decomposition is the method introduced in this paper. It extracts information about which words contributed the maximum and minimum amounts towards LSTM prediction, and also how they were combined in order to yield the final prediction. The LSTM output is mathematically decomposed and the contributions are disambiguated at each step by different parts of the sentence. In the application domain, this paper shows how the contextual decomposition method is used to successfully extract positive and negative negations from an LSTM. This paper also shows that the prior interpretation methods have document-level<br />
information built into them in complex, unspecified ways. For example, in the prior work, strongly negative phrases contained within positive reviews are viewed as neutral, or even positive.<br />
<br />
==Intuition of the paper==<br />
<br />
If we consider a sentence in an Amazon review stating "This bag was good" it is very clearly a positive sentence, where the word "good" contributes maximum towards the positivity. On the other hand, if the review reads "This bag was not good" this becomes a negative review. Note that the negative review is not highly influenced by any individual word; most of the influence stems from an interaction between two words "not" and "good". This interaction modleled by the authors gives them an extra degree of freedom. Thus, they can have a better interpretation of the model. In this paper, the authors focus on this interaction between words for studying model efficiency and explainability. (Reference, author's talk in: https://www.youtube.com/watch?v=GjpGAyJenCM).<br />
<br />
==Overview of previous work==<br />
<br />
There has been research towards developing methods to understand the evaluations provided by LSTMs. Some of them are in line with the work done in this particular paper while others have followed some different approaches.<br />
<br />
Approaches similar to the one provided in this paper - All of the approaches in this category have tried to look into computing just the word-level importance scores with varying evaluation methods.<br />
#Murdoch & Szlam (2017): introduced a decomposition of the LSTM's output embedding and learned the importance score of certain words and phrases from those words (sum of the importance of words). A classifier is then built that searches for these learned phrases that are important and predicts the associated class which is then compared with the output of the LSTMs for validation.<br />
#Li et al. (2016): (Leave one out) They observed the change in log probability of a function by replacing a word vector (with zero) and studying the change in the prediction of the LSTM. It is completely anecdotal. Additionally provided an RL model to find the minimal set of words that must be erased to change the model's decision (although this is irrelevant to interpretability).<br />
#Sundararajan et al. 2017 (Integrated Gradients): a general gradient based technique to learn importance evaluated theoretically and empirically. Built up as an improvement to methods which were trying to do something quite similar. It is tested on image, text and chemistry models.<br />
<br />
Decomposition-based approaches for CNN:<br />
#Bach et al. 2015: proposed a solution to the problem of understanding classification decisions of CNNs by pixel-wise decomposition. Pixel contributions are visualized as heat maps.<br />
#Shrikumar et al. 2016 (DeepLift): an algorithm based on a method similar to backpropagation to learn the importance score for the inputs for a given output. The algorithm to learn these input scores is not dependent on the gradient therefore learning can also happen if the gradient is zero during backpropagation.<br />
<br />
Focussing on analyzing the gate activations:<br />
#Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. For eg. a cell being activated for keeping track of open parathesis or quotes.<br />
#Strobelt et al. 2016: built a visual tool for understanding and analyzing raw gate activations.<br />
<br />
Attention-based models:<br />
#Bahdanau et al. (2014): These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. But the problem with it is twofold. Firstly, it is only an indirect indicator of importance and does not provide directionality. Secondly, they have not been evaluated empirically or otherwise as an interpretation technique. Although they have been used in multiple other applications and architectures for solving a variety of problems.<br />
<br />
==Long Short-Term Memory Networks==<br />
Over the past few years, LSTM has become a core component of neural NLP systems and sequence modeling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference is that the modules are little more complicated. Instead of having a single tanh layer like in an RNN, they have four (called gates), interacting in a special way. Additionally, they have a cell state which runs through the entire chain of the network. It helps in managing the information from the previous cells in the chain.<br />
<br />
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.<br />
<br />
\begin{align}<br />
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o)<br />
\end{align}<br />
\begin{align}<br />
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f)<br />
\end{align}<br />
\begin{align}<br />
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i)<br />
\end{align}<br />
\begin{align}<br />
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g)<br />
\end{align}<br />
\begin{align}<br />
c_t = f_t \odot c_{t−1} + i_t \odot g_t<br />
\end{align}<br />
\begin{align}<br />
h_t = o_t \odot tanh(c_t)<br />
\end{align}<br />
<br />
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and <math> \odot </math> denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication. Intuitively we can think of the forget gate as how much previous memory(information) do we want to forget; input gate as controlling whether or not to let new input in; g gate controlling what do we want to add and finally the output gate as controlling how much the current information(at current time step) should flow out.<br />
<br />
A visualization of an LSTM can be seen below (Reference, Nvidia Corporation: Long Short-Term Memory (LSTM)). Note that both the input and recurrent elements are seen 'entering' the cell in various locations. The output, and recurrent elements, can be seen 'exiting' the cell at the top of the image.<br />
<br />
[[File: WF_SecCont_01_lstm.png|600px|center]]<br />
<br />
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.<br />
<br />
\begin{align}<br />
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) }<br />
\end{align}<br />
<br />
==Contextual Decomposition(CD) of LSTM==<br />
CD decomposes the output of the LSTM into a sum of two contributions:<br />
# those resulting solely from the given phrase<br />
# those involving other factors<br />
<br />
One important thing that is crucial to understand is that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence and returns the vectors with the contributions.<br />
<br />
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.<br />
<br />
\begin{align}<br />
h_t = \beta_t + \gamma_t<br />
\end{align}<br />
\begin{align}<br />
c_t = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
In the decomposition <math>\beta_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase while <math> \gamma_t </math> denotes contribution atleast in part from the other factors. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and atleast in part from the other factors respectively.<br />
<br />
Using this decomposition the softmax function can be represented as follows<br />
\begin{align}<br />
p = SoftMax(W\beta_T + W\gamma_T)<br />
\end{align}<br />
<br />
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.<br />
<br />
===Disambiguation Interaction between gates===<br />
<br />
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.<br />
<br />
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have<br />
\begin{align}<br />
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\<br />
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)<br />
\end{align}<br />
<br />
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).<br />
<br />
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is contained within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>). In this summary, we will derive the equations for the former case. A very important thing to understand is that given any word/phrase within a sentence this algorithm would make a full pass over the LSTM to compute the 2 different contributions.<br />
<br />
So essentially what we are going to do now is linearize each of the gates, then expand the product of sums of these gates and then finally group the terms we get depending on which type of interaction they represent (solely from phase, solely from other factors and a combination of both).<br />
<br />
Terms are determined to be derived solely from the specific phrase if they involve products from some combination of <math>\beta_{t-1}, \beta_{t-1}^c, x_t</math> and <math> b_i </math> or <math> b_g </math>(but not both). In the other case when t is not within the phrase, products involving <math> x_t </math> are treated as not deriving from the phrase. This(the other case) can be observed by seeing the equations for this specific case in the appendix of the original paper.<br />
<br />
\begin{align}<br />
f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\<br />
& = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\<br />
& = \beta_t^f + \gamma_t^f<br />
\end{align}<br />
<br />
Similarly<br />
<br />
\begin{align}<br />
i_t\odot g_t &= [(L_\sigma(W_ix_t) + L_\sigma(V_i\beta_{t-1}) + L_\sigma(V_i\gamma_{t-1}) + L_\sigma(b_i))] \\<br />
& \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(V_g\gamma_{t-1}) + L_{tanh}(b_g))] \\<br />
& = ([L_\sigma(W_ix_t) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_i\beta_{t-1}) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(b_i) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1})] \\ <br />
&+ [L_\sigma(V_i\gamma_{t-1}) \odot g_t + i_t \odot L_{tanh}(V_g\gamma_{t-1}) - L_\sigma(V_i\gamma_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1}) \\ <br />
&+ L_\sigma(b_i) \odot L_{tanh}(b_g)] \\<br />
& = \beta_t^u + \gamma_t^u<br />
\end{align}<br />
<br />
Thus we can represent <math>c_t</math> as<br />
<br />
\begin{align}<br />
c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\<br />
& = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\<br />
& = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
So once we have the decomposition of <math> c_t </math>, then we can rather simply calculate the transformation of <math> h_t </math> by linearizing the <math> tanh</math> function. Again at this point, we just assume that a linearizing function for <math> tanh </math> exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally <math> h_t </math> can be written as<br />
<br />
\begin{align}<br />
h_t &= o_t \odot tanh(c_t) \\<br />
& = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\<br />
& = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\<br />
& = \beta_t + \gamma_t<br />
\end{align}<br />
<br />
===Linearizing activation functions ===<br />
<br />
This section will explain the big assumption that we took earlier about the linearizing functions <math> L_{\sigma} </math> and <math> L_{tanh} </math>. For arbitrary { <math> y_1, ..., y_N </math> } <math> \in R </math>, the problem that we intend to solve is essentially<br />
\begin{align}<br />
tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i)<br />
\end{align}<br />
<br />
In cases where {<math> y_i </math>} follow a natural ordering, work in Murdoch & Szlam, 2017 where the difference of partial sums is utilized as a linearization technique could be used. This could be shown by the equation below<br />
\begin{align}<br />
L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j)<br />
\end{align}<br />
<br />
But in our case the terms do not follow any particular ordering, for e.g. while calculating <math> i_t </math> we could write it as a sum of <math> W_ix_t, V_ih_{t−1}, b_i </math> or <math> b_i, V_ih_{t−1}, W_ix_t </math>. Thus, we average over all the possible orderings. Let <math>\pi_i, ..., \pi_{M_n} </math> denote the set of all permutations of <math>1, ..., N</math>, then the score could be given as below<br />
<br />
\begin{align}<br />
L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k) - 1}y_{\pi_i(j)})]<br />
\end{align}<br />
<br />
We can similarly derive <math> L_{\sigma} </math>. An important empirical observation to note here is that in the case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term.<br />
In our case, the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example of a case where N=2 is shown below.<br />
<br />
\begin{align}<br />
L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)])<br />
\end{align}<br />
<br />
==Experiments==<br />
As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper verifies the following 3 tasks with the experiments:<br />
# It should work on the standard problem of word-level importance scores<br />
# It should behave for words as well as phrases especially in situations involving compositionality.<br />
# It should be able to extract instances of positive and negative negation.<br />
<br />
An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The code can be found at "https://github.com/jamie-murdoch/ContextualDecomposition". The model was trained using Adam with the learning rate of 0.001 and using early stopping on the validation set. Additionally, a bag of words linear model was used.<br />
<br />
All the experiments were performed on the Stanford Sentiment Treebank(SST) [Socher et al., 2013] dataset and the Yelp Polarity(YP) [Zhang et al., 2015] dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. It is important to note that the SST dataset has one key feature that is perfect for this task, which is in addition to review-level labels, <br />
it also provides labels for each phrase in the binarized constituency parse tree. This enables us to examine that if the model can identify negative phrases out of a positive review, or vice versa. The word embedding used in LSTM is pretrained Glove vectors with length equal to 300, and the hidden representations of the LSTM is set to be 168. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In the case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.<br />
<br />
<br />
===Baselines===<br />
The interpretations are compared with 4 state-of-the-art baselines for interpretability.<br />
# Cell Decomposition(Murdoch & Szlam, 2017), <br />
# Integrated Gradients (Sundararajan et al., 2017),<br />
# Leave One Out (Li et al., 2016),<br />
# Gradient times input [gradient of the output probability with respect to the word embeddings is computed which is finally reported as a dot product with the word vector]<br />
<br />
To obtain phrase scores for word-based baselines integrated gradients, cell decomposition, and gradients, the paper sums the scores of the words contained within the phrase.<br />
<br />
<br />
===Unigram(Word) Scores===<br />
Logistic regression(LR) coefficients while being sufficiently accurate for prediction are considered the gold standard for interpretability. For the task of sentiment analysis, the importance of the words is given by their coefficient values. Thus we would expect the CD scores extracted from an LSTM, to have meaningful relationships and comparison with the logistic regression coefficients. This comparison is done using scatter plots(Fig 4) which measures the Pearson correlation coefficient between the importance scores extracted by LR coefficients and LSTM. This is done for multiple words which are represented as a point in the scatter plots. For SST, CD and Integrated Gradients, with correlations of 0.76 and 0.72, respectively, are substantially better than other methods, which have correlations of at most 0.51. On Yelp, the gap is not as big, but CD is still very competitive, having correlation 0.52 with other methods ranging from 0.34 to 0.56. The complete results are shown in Table 4.<br />
[[File:Dhruv Table4.png|600px|centre]]<br />
<br />
===Benefits===<br />
Having verified reasonably strong results in the base case, the paper then proceeds to show the benefits of CD.<br />
====Identifying Dissenting Subphrases====<br />
First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase(a phrase is considered to be of at most 5 words) with different sentiments. For example, consider the phrase "used to be my favorite". The word "favorite" is strongly positive which is also shown by it having a high linear regression coefficient. Nonetheless, the existing methods identify "favorite" as being highly negative or neutral in this context. However, as shown in table 1 CD is able to correctly identify it being strongly positive, and the subphrase "used to be" as highly negative. This particular identification is itself the main reason for using the LSTMs over other methods in text comprehension. Thus, it is quite important that an interpretation algorithm is able to properly uncover how these interactions are being handled. A search across the datasets is done to find similar cases where a negative phrase contains a positive sub-phrase and vice versa. Phrases are scored using the logistic regression over n-gram features and included if their overall score is over 1.5.<br />
[[File:Dhruv Table1.png|600px|centre]]<br />
It is to be noted that for an efficient interpretation algorithm the distribution of scores for these positive and negative dissenting subphrases should be significantly separate with the positive subphrases having positive scores and vice-versa. However, as shown in figure 2, this is not the case with the previous interpretation algorithms.<br />
<br />
====Examining High-Level Compositionality====<br />
The paper now studies the cases where a sizable portion of a review(between one and two-thirds) has a different polarity from the final sentiment. An example is shown in Table 2. SST contains phrase-level sentiment labels too. Therefore the authors conduct a search in SST where a sizable phrase in a review is of the opposite polarity than the review-level label. The figure shows the distribution of the resulting positive and negative phrases for different attribution methods. We should note that a successful interpretation method would have a sizeable gap between these two distributions. Notice that the previous methods fail to satisfy this criterion. The paper additionally provides a two-sample Kolmogorov-Smirnov one-sided test statistic, to quantify this difference in performance. This statistic is a common difference measure for the difference of distributions with values ranging from 0 to 1. As shown in Figure 3 CD gets a score of 0.74 while the other models achieve a score of 0(Cell decomposition), 0.33(Integrated Gradients), 0.58(Leave One Out) and 0.61(gradient). The methods leave one out and gradient perform relatively better than the other 2 baselines but they were the weakest performers in the unigram scores. This inconsistency in other methods performance further strengthens the superiority of CD.<br />
[[File:Dhruv Table2.png|600px|centre]]<br />
<br />
====Capturing Negation====<br />
The paper also shows a way to empirically show how the LSTMs capture negations in a sentence. To search for negations, the following list of negation words were used: not, n’t, lacks, nobody, nor, nothing, neither, never, none, nowhere, remotely. Again using the phrase labels present in SST, the authors search over the training set for instances of negation. Both the positive as well as negative negations are identified. For a given negation phrase, they extract a negation interaction by computing the CD score of the entire phrase and subtracting the CD scores of the phrase being negated and the negation term itself. The resulting score can be interpreted as an n-gram feature. Apart from CD, only leave one out is capable of producing such interaction scores. The distribution of extracted scores is presented in Figure 1. For CD there is a clear distinction between positive and negative negations. Leave one out is able to capture some of the interactions, but has a noticeable overlap between positive and negative negations around zero, indicating a high rate of false negatives.<br />
<br />
====Identifying Similar Phrases====<br />
A key aspect of the CD algorithm is that it helps us learn the value of <math> \beta_t </math> which is essentially a dense embedding vector for a word or a phrase. The way the authors did in the paper is to calculate the AVG(<math> \beta_t </math> ), and then, using similarity measures in the embedding space(eg. cosine similarity) we can easily find similar phrases/words given a phrase/word. The results as shown in Table 3 are qualitatively sensible for 3 different kinds of interactions: positive negation, negative negation and modification, as well as positive and negative words.<br />
[[File:Dhruv Table3.png|600px|centre]]<br />
<br />
==Conclusions==<br />
The paper provides an algorithm called Contextual Decomposition(CD) to interpret predictions made by LSTMs without modifying their architecture. It takes in a trained LSTM and breaks it down in components and quantifies the interpretability of its decision. In both NLP and in general applications CD produces importance scores for words (single variables in general), phrases (several variables together) and word interactions (variable interactions). It also compares the algorithm with state-of-the-art baselines and shows that it performs favorably. It also shows that CD is capable of identifying phrases of varying sentiment and extracting meaningful word (or variable) interactions. It shows the shortcomings of the traditional word-based interpretability approaches for understanding LSTMs and advances the state-of-the-art.<br />
<br />
==Critique/Future Work==<br />
While the method itself is novel in the sense that it moves past the traditional approach of looking just at word level importance scores; it only looks at one specific architecture which is applied to a very simple problem. The authors don't talk about any future directions for this work in the paper itself but a discussion about it happened during the Oral presentation of the paper at ICLR 2018. Following are the important points:<br />
#We could look at interpreting a more complex model, for example, say seq2seq. The author pointed out that he was affirmative that this model could be extended for such purposes although the computational complexity would increase since we would be predicting multiple outputs in this case.<br />
#We could also look at whether this approach could be generalized to completely different architectures like CNN. A later related approach attempted to interpret Neural Networks With Nearest Neighbors to provide a metric that helps to create feature importance values (Reference, Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors). As of now given a new model, we need to manually work out the math for the specific model. Could we develop some general approach towards this? Although the author pointed out that they are working towards using this approach to interpret CNNs.<br />
* It would be an exciting prospect for future work to compare the output of the algorithms with human given scores on a small subset of words.<br />
<br />
==References==<br />
W. James Murdoch, Peter J. Liu, Bin Yu. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. ICLR 2018<br />
<br />
Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural Computation, 9(8): 1735–1780, 1997.<br />
<br />
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.<br />
<br />
Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016. URL http://arxiv.org/abs/1612.08220.<br />
<br />
W James Murdoch and Arthur Szlam. Automatic rule extraction from long short-term memory networks. ICLR, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.<br />
<br />
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017<br />
<br />
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.<br />
<br />
Hendrik Strobelt, Sebastian Gehrmann, Bernd Huber, Hanspeter Pfister, and Alexander M Rush. Visual analysis of hidden state dynamics in recurrent neural networks. arXiv preprint arXiv:1606.07461, 2016.<br />
<br />
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017. URL http://arxiv.org/abs/1703.01365<br />
<br />
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657, 2015<br />
<br />
Jamie Murdoch, Beyond Word Importance: Contextual Decomposition for Interpreting LSTMs. https://www.youtube.com/watch?v=GjpGAyJenCM<br />
<br />
Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors, URL https://arxiv.org/abs/1809.02847<br />
<br />
Nvidia Corporation: Long Short-Term Memory (LSTM), URL https://developer.nvidia.com/discover/lstm, Accessed: October 21, 2018<br />
<br />
==Appendix==<br />
<br />
[[File:Dhruv Figure4.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure2.png|600px|right]]<br />
<br />
<br />
[[File:Dhruv Figure3.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure1.png|600px|right]]</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Autoregressive_Convolutional_Neural_Networks_for_Asynchronous_Time_Series&diff=41549stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series2018-11-27T09:44:15Z<p>D35kumar: /* Introduction */ Adding details (Technical, Editorial)</p>
<hr />
<div>This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].<br />
<br />
=Introduction=<br />
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks and is evaluated on datasets such as hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series, and a UCI household electricity consumption dataset. This paper focusses on time series with multivariate and noisy signals, especially the financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. (Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that changes in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.<br />
<br />
The time series forecasting problem can be expressed as a conditional probability distribution below,<br />
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div><br />
Thus, we focus on modeling the predictors of future values of time series given their past values. <br />
The predictability of financial dataset still remains an open problem and is discussed in various publications. ([2])<br />
<br />
[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same CDS2 throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]<br />
<br />
The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.<br />
<br />
=Related Work=<br />
===Time series forecasting===<br />
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.<br />
<br />
Although deep neural networks have been applied into many fields and produced satisfactory results, there still are little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate. <br />
<br />
In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.<br />
<br />
===Gating and weighting mechanisms===<br />
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. This composition of functions may lead to popular recurrent architecture such as LSTM and GRU[11].<br />
<br />
The idea of the gating system is aimed to weight outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.<br />
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x), p(x)=softmax(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs(composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs. <br />
<br />
This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the method is similar as this. The difference is that modelling the functions as multi-layer CNNs. Another difference is that not using recurrent layers, which can enable the network to remember the parts of the sentence/image already translated/described.<br />
<br />
=Motivation=<br />
There are mainly five motivations they stated in the paper:<br />
#The forecasting problem in this paper has done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics are more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.<br />
#Although Gaussian processes provide useful theoretical framework that is able to handle asynchronous data, they often follow heavy-tailed distribution for financial datasets.<br />
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.<br />
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).<br />
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the durations, but this approach may lead to imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.<br />
<br />
[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]<br />
<br />
<br />
=Model Architecture=<br />
Suppose there's a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math><br />
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | {x_{n-m}, m=1,2,...}], </math></div><br />
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.<br />
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. The estimator of <math>y_n</math> can be expressed as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div><br />
This is summation of the columns of the matrix in bracket, where<br />
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. S is a fully convolutional network which is composed of convolutional layers only. <math>F</math> is in the form of<br />
<math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [off(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> where <math> W \in \mathbb{R}^{d_I \times M}</math> and <math> off: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.<br />
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T,...,a_{d_I}^T)^T)=(\sigma(a_1)^T,...\sigma(a_{d_I})^T)^T </math><br />
# <math>\otimes</math> is element-wise matrix multiplication.<br />
#<math>A.,_m</math> denotes the m-th column of a matrix A, and <math>\sum_{m=1}^M A.,_m=A(1,1,...,1)^T</math>.<br />
Since <math>\sum_{m=1}^M W.,_m=W(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div><br />
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>off</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.<br />
Figure 3 shows the scheme of network.<br />
<br />
[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]<br />
<br />
The form of <math>\hat{y}_n</math> forced to separate the temporal dependence (obtained in weights <math>W_m</math>). S is determined by its filters which capture local dependencies and are independent of the relative position in time, the predictors <math>off(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals, and might be heterogenous, therefore adjustment of offset network is important.In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.<br />
<br />
===Relation to asynchronous data===<br />
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem<br />
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.<br />
#Add additional features: Treating duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).<br />
<br />
===Loss function===<br />
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:<br />
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div><br />
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:<br />
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div><br />
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.<br />
<br />
=Experiments=<br />
The paper evaluated SOCNN architecture on three datasets: artificial generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from evaluation of the SOCNN architecture the paper also discusses the impact of network components such as: such as auxiliary<br />
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]<br />
<br />
==Datasets==<br />
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.<br />
<br />
Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.<br />
<br />
Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction we want to predict the next quoted price from this given source and direction considering the last 60 quotes.<br />
<br />
==Training details==<br />
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.<br />
<br />
They chose LeakyReLU as activation function for all networks:<br />
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div><br />
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.<br />
<br />
Table 1 presents the configuration of network hyperparameters used in comparison<br />
<br />
[[File:Junyi4.png | 400px|center|]]<br />
<br />
===Network Training===<br />
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio 3 to 1. The remaining 20% of data was used as a test set.<br />
<br />
All models were trained using Adam optimizer, because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.<br />
<br />
They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization in between each convolution and the following activation. <br />
<br />
At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.<br />
<br />
Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]<br />
<br />
The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.<br />
<br />
==Results==<br />
Table 2 shows all results performed from all datasets.<br />
[[File:Junyi5.png | 600px|center|]]<br />
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.<br />
[[File:Junyi6.png | 400px|center|]]<br />
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.<br />
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]<br />
<br />
Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.<br />
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]<br />
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.<br />
<br />
=Conclusion and Discussion=<br />
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series, and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks. <br />
<br />
The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.<br />
<br />
=Critiques=<br />
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.<br />
#The quote data cannot be reached, only two datasets available.<br />
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.<br />
#The transform of the original data to asynchronous data is not clear.<br />
#The experiments on the main application are not reproducible because the data is proprietary.<br />
#The way that train and test data were splitted is unclear. This could be important in the case of the financial data set.<br />
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.<br />
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.<br />
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.<br />
#The paper uses financial/economic datas as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.<br />
<br />
=References=<br />
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994. <br />
<br />
[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.<br />
<br />
[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.<br />
<br />
[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.<br />
<br />
[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.<br />
<br />
[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.<br />
<br />
[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.<br />
<br />
[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.<br />
<br />
[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.<br />
<br />
[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.<br />
<br />
[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.<br />
<br />
[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.<br />
<br />
[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.<br />
<br />
[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41548DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-27T09:24:47Z<p>D35kumar: /* Related Work */ Adding Related work (Technical)</p>
<hr />
<div>=Introduction=<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. However, within several areas, like eg: computation social science, interpretability is utmost. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
*) Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
*) Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
*) Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
*) Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
*) Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space.<br />
<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41547DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-27T09:06:34Z<p>D35kumar: /* Introduction */ (Technical and Editorial)</p>
<hr />
<div>=Introduction=<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. However, within several areas, like eg: computation social science, interpretability is utmost. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
a) Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
B) Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms.<br />
<br />
2. Interpretability<br />
- Identifying the feature importance from input gradients (Hechtlinger, 2016; Ross et al., 2017; Sundararajan et al., 2017)<br />
- Feature map visualization (Yosinski et al., 2015)<br />
- The authors' approach is to extract interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space.<br />
<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41546DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-27T08:42:39Z<p>D35kumar: Adding Intuition {Technical}</p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [[https://arxiv.org/pdf/1711.00489.pdf]]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice to not have a singular steady learning rate for the learning phase of the neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function:<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the random fluctuations by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, g decreases. In addition, increasing the batch size, has the same effect and makes g decays. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. For instance, in physical sciences, decaying the temperature in discrete steps can make the system stuck in a local minimum while slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=41540policy optimization with demonstrations2018-11-27T07:40:01Z<p>D35kumar: /* Introduction */ Adding point wise contribution [Technical]</p>
<hr />
<div>= Introduction =<br />
<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policies to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore states that have never been seen. 2) Guide the agent to imitate a demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method from demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. To summarize, the authors bring forth this idea through the following techniques:<br />
1) A demonstration guided exploration term measuring the divergence between current and the expert policy is added to the policy optimization objective, increasing the similarity to expert-like exploration <br />
2) They say that for better learning from demonstrations and getting an optimization friendly lower bound, the proposed objective could be defined on an occupancy measure as in [14].<br />
3) Finally, they show that the optimization can move towards optimizing the derived lower bound and the generative adversarial training.<br />
The authors also evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explore new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshaped in terms of the native rewards in RL. At present the state of the art exploration in Reinforcement learning is simply epsilon greedy which just makes random moves for a small percentage of times to explore unexplored moves. This is very naive and is one of the main reasons for the high sample complexity in RL. On the other hand, if there is an expert demonstrator who can guide exploration, the agent can make more guided and accurate exploratory moves.<br />
<br />
=Related Work =<br />
There are some related works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD (Deep Q-learning from Demonstrations) [2] which are applied into the discrete action spaces and DDPGfD (Deep<br />
Deterministic Policy Gradient from Demonstrations) [3] which extends this idea to the continuous spaces. But both of them under-utilize the demonstration data.<br />
# There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
<br />
Both of the above methods are effective for imitation learning, but cannot leverage the valuable feedback given by the environments and usually suffer from bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
There is also another idea in which an agent learns using hybrid imitation learning and reinforcement learning reward[23, 24]. However, unlike this paper, they did not provide some theoretical support for their method and only explained some intuitive explanations.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma⟩ </math>, where <math>\mathcal{S}</math> is the state space, <math>\mathcal{A} </math> is the action space, <math>\mathcal{P}(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is the discount factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action probabilities, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
Generally, RL tasks and environments do not provide a comprehensive reward and instead rely on sparse feedback indicating whether the goal is reached.<br />
<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where the i-th trajectory <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. This is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages.<br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy <math>\pi_{old}</math> in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. POfD imposes an additional regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> between <math>\pi_\theta</math> and <math>\pi_{E}</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
In fact, POfD brings another factor, <math>D_{J S}^{max}(\pi_{i}, \pi_{E})</math>, that would fully use the advantage <math>{\hat \delta}</math>and add improvements with a margin over pure policy gradient methods.<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of the Jensen-Shannon divergence instead of directly optimizing the difficult Jensen-Shannon divergence. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：<br />
[[File:them2.png|450px|center]]<br />
Thus, the occupancy measure matching objective can be written as:<br />
[[File:eqnlm.png|450px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: \mathcal{S}\times \mathcal{A} \rightarrow (0,1)</math> is an arbitrary mapping function followed by a sigmoid activation function used for scaling, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem closely resembles the minimax problem related to the Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|450px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco (Multi-Joint dynamics with Contact) [5] (Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. MuJoCo is a physics engine aiming to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed. In order to get familiar with OpenAI Gym and Mujoco environment, you can watch these videos, respectively: [http://www.mujoco.org/image/home/mujocodemo.mp4 Mujoco], [https://gym.openai.com/v2018-02-21/videos/SpaceInvaders-v0-4184afb3-1223-4ac6-b52b-8e863cbe24a5/original.mp4 OpenAI Gym]). Due to the uniqueness of the environments, the authors introduce 4 ways to sparsify their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisel 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO (Trust Region Policy Optimization) in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converges to the imperfect demonstration in MountainCar [22].<br />
<br />
[[File:pofdf2.png|500px|center]]<br />
<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of accumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.<br />
<br />
The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The environment reward is sparse: every time the arm reaches the ball and holds for a while (e.g., 5 time steps), it receives a reward of +1; otherwise it gets zero reward. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.<br />
<br />
=Conclusion=<br />
In this paper, a method, POfD, is proposed that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivness of POfD in encouraging the agent to explore around the nearby region of the expert policy and learn better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behaviour when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
# The performance of this algorithm hinges on the assumption that expert demonstrations are near optimal in the action space. As seen in figure 3, there appears to be an upper bound to performance near (or just above) the expert accuracy -- this may be an indication of a performance ceiling. In games where near-optimal policies can differ greatly (e.g.; offensive or defensive strategies in chess), the success of the model will depend on the selection of expert demonstrations that are closest to a truly optimal policy (i.e.; just because a policy is the current expert, it does not mean it resembles the true optimal policy).<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.<br />
<br />
[23] Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramar, J., Hadsell, R., de Freitas, N., et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.<br />
<br />
[24] Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=41112a neural representation of sketch drawings2018-11-23T04:36:08Z<p>D35kumar: /* Applications and Future Work */ (Technical)</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, The authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation. <br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot and some reinforcement learning approaches. They work more like a mimic of digitized photographs. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models or Mixture Density Networks to generate human sketches, continuous data points or vectorized Kanji characters.<br />
<br />
The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.<br />
<br />
The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by an online game Quick Draw! where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px|center]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). The encoder is a bidirectional RNN, the input is a sketch sequence denoted by <math>S</math> and a reversed sketch sequence denoted by <math>S_{reverse}</math>, so there will be two final hidden states. The output is a size <math>N_{z}</math> latent vector.<br />
<br />
\begin{align*}<br />
h_{ \rightarrow} = encode_{ \rightarrow }(S), <br />
h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), <br />
h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{align*}<br />
<br />
Then the authors project <math>h</math> into to <math>\mu</math> and <math>\hat{\sigma}</math>. Both <math>\mu</math> and <math>\hat{\sigma}</math> are vectors of size <math>N_{z}</math>. The projection is performed using a fully connected layer. Then, using the exponential function the authors convert <math>\hat{\sigma}</math> into a non-negative standard deviation parameter denoted by <math>\sigma</math>. The Authors then use <math>\mu</math> and <math>\sigma</math> with <math>\mathcal{N}(0,I)</math> to construct a random vector <math>z\in\mathbb{R}^{N_{z}}</math>.<br />
<br />
\begin{align*}<br />
\mu = W_\mu h + b_\mu, <br />
\hat \sigma = W_\sigma h + b_\sigma, <br />
\sigma = exp( \frac{\hat \sigma}{2}), <br />
z = \mu + \sigma \odot \mathcal{N}(0,I).<br />
\end{align*}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a conditioned random vector.<br />
<br />
The decoder is an autoregressive RNN. The initial hidden states are generated using <math>[h_0;c_0] = \tanh(W_z z+b_z)</math> where <math>c_0</math> is utilized if applicable. <math>S_0</math> is defined as <math>(0,0,1,0,0)</math>. For each step i in the decoder, the input <math>x_i</math> is the concatenation of previous point <math>S_{i-1}</math> and latent vector <math>z</math>. The output are probability distribution parameters for the next data point <math>S_i</math>. The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1. The generated sequence is conditioned from the latent vector <math>z</math> that is sampled from the encoder, which is end-to-end trained together with the decoder.<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1<br />
\end{align*}<br />
<br />
Here the <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is the probability distribution function for <math>x,y</math>. Each of the M bivariate normal distributions being summed have five parameters: <math>(\mu_x, \mu_y, \sigma_x, \sigma_y, \rho_{xy})</math>, where <math>\mu_x, \mu_y</math> are means, <math>\sigma_x, \sigma_y</math> are the standard deviations, and <math>\rho_{xy}</math>is the correlation parameter for these bivariate normal distribution. The <math>\Pi</math> is a length M categorical distribution vector are the mixture weights of the Gaussian mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{align*}<br />
x_i = [S_{i-1}; z], <br />
[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), <br />
y_i = W_y h_i + b_y,<br />
y_i \in \mathbb{R}^{6M+3}.<br />
\end{align*}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x),\ <br />
\sigma_y = \exp (\hat \sigma_y),\ <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.<br />
<br />
=== Unconditional Generation ===<br />
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter τ = 0.2 at the top in blue, to τ = 0.9 at the bottom in red.<br />
<br />
[[File:sketchfig3.png|700px|center]]<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an IID Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly.<br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px|center]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px|center]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set τ = 0.8 to complete samples. Figure 7 shows the results.<br />
<br />
[[File:sketchfig7.png|700px|center]]<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can be found useful in many creative applications. For example, the decoder only model could assist artists by suggesting how to finish a sketch, helping them expand their imagination whereas the conditional model can help them find interesting intersections between different drawings or objects, or generating a lot of similar but different designs.<br />
<br />
This model may also find its place on teaching students how to draw. When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical sketch. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
The authors conclude by providing the following future directions to this work:<br />
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.<br />
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.<br />
<br />
It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.<br />
<br />
== Conclusion ==<br />
This paper introduced an interesting model sketch-rnn that can encode and decode sketches, generate and complete unfinished sketches. The authors demonstrated how to interpolate between latent spaces from a different class and how to use it to augment sketches or generate similar looking sketches. They also showed that it's important to enforce a prior distribution on latent vector while interpolating coherent sketch generations. Finally, they created a large sketch drawings dataset to be used in future research.<br />
<br />
== Critique ==<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment.<br />
<br />
* Same problem as the output, the authors didn't present an evaluation for the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance.<br />
<br />
* I understand that using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
* ([https://openreview.net/forum?id=Hy6GHpkCW]) The paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches.<br />
<br />
+ new and large dataset<br />
<br />
+ novel algorithm<br />
<br />
+ well written<br />
<br />
- no evaluation of dataset<br />
<br />
- virtually no evaluation of the algorithm<br />
<br />
- no baselines or comparison<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-encoders&diff=41107Wasserstein Auto-encoders2018-11-23T04:18:53Z<p>D35kumar: /* Introduction */</p>
<hr />
<div>The first version of this work was published in 2017 and this version (which is the third revision) is presented in ICLR 2018. Source code for the first version is available [https://github.com/tolstikhin/wae here]<br />
<br />
=Introduction=<br />
Early successes in the field of representation learning were based on supervised approaches, which used large labeled datasets to achieve impressive results. On the other hand, popular unsupervised generative modeling methods mainly consisted of probabilistic approaches focusing on low dimensional data. In recent years, there have been models proposed which try to combine these two approaches. One such popular method is called variational auto-encoders (VAEs). VAEs are theoretically elegant but have a major drawback of generating blurry sample images when used for modeling natural images. In comparison, generative adversarial networks (GANs) produce much sharper sample images but have their own list of problems which includes a lack of encoder, harder to train, and the "mode collapse" problem. Mode collapse problem refers to the inability of the model to capture all the variability in the true data distribution. Currently, there has been a lot of activity around finding and evaluating numerous GANs architectures and combining VAEs and GANs but a model which combines the best of both GANs and VAEs is yet to be discovered.<br />
<br />
The work done in this paper builds upon the theoretical work done in [4]. The authors tackle generative modeling using optimal transport (OT). The OT cost is defined as the measure of distance between probability distributions. One of the features of OT cost which is beneficial is that it provides much weaker topology when compared to other costs including f-divergences which are associated with the original GAN algorithms. <br />
This particular feature is crucial in applications where the data is usually supported on low dimensional manifolds in the input space. This results in a problem with the stronger notions of distances such as f-divergences as they often max out and provide no useful gradients for training. In comparison, the OT cost has been claimed to behave much more nicely [5, 8]. Despite the preceding claim, the implementation, which is similar to GANs, still requires the addition of a constraint or a regularization term into the objective function.<br />
<br />
==Original Contributions==<br />
Let <math>P_X</math> be the true but unknown data distribution, <math>P_G</math> be the latent variable model specified by the prior distribution <math>P_Z</math> of latent codes <math>Z \in \mathcal{Z}</math> and the generative model <math>P_G(X|Z)</math> of the data points <math>X \in \mathcal{X}</math> given <math>Z</math>. The goal in this paper is to minimize <math>OT\ W_c(P_X, P_G)</math>.<br />
<br />
The main contributions are given below:<br />
<br />
* A new class of auto-encoders called Wasserstein Auto-Encoders (WAE). WAEs minimize the optimal transport <math>W_c(P_X, P_G)</math> for any cost function <math>c</math>. As is the case with VAEs, WAE objective function is also made up of two terms: the c-reconstruction cost and a regularizer term <math>\mathcal{D}_Z(P_Z, Q_Z)</math> which penalizes the discrepancy between two distributions in <math>\mathcal{Z}: P_Z\ and\ Q_Z</math>. <math>Q_Z</math> is a distribution of encoded points, i.e. <math>Q_Z := \mathbb{E}_{P_X}[Q(Z|X)]</math>. Note that when <math>c</math> is the squared cost and the regularizer term is the GAN objective, WAE is equivalent to the adversarial auto-encoders described in [2].<br />
<br />
* Experimental results of using WAE on MNIST and CelebA datasets with squared cost <math>c(x, y) = ||x - y||_2^2</math>. The results of these experiments show that WAEs have the good features of VAEs such as stable training, encoder-decoder architecture, and a nice latent manifold structure while simultaneously improving the quality of the generated samples.<br />
<br />
* Two different regularizers. One based on GANs and adversarial training in the latent space <math>\mathcal{Z}</math>. The other one is based on something called "Maximum Mean Discrepancy" which known to have high performance when matching high dimensional standard normal distributions. The second regularizer also makes the problem fully adversary-free min-min optimization problem.<br />
<br />
* The final contribution is the mathematical analysis used to derive the WAE objective function. In particular, the mathematical analysis shows that in the case of generative models, the primal form of <math>W_c(P_X, P_G)</math> is equivalent to a problem which deals with the optimization of a probabilistic encoder <math>Q(Z|X)</math><br />
<br />
=Proposed Method=<br />
The method proposed by the authors uses a novel auto-encoder architecture to minimize the optimal transport cost <math>W_c(P_X, P_G)</math>. In the optimization problem that follows, the decoder tries to accurately reconstruct the data points as measured by the cost function <math>c</math>. The encoder tries to achieve the following two conflicting goals at the same time: (1) try to match the distribution of the encoded data points <math>Q_Z := \mathbb{E}_{P_X}[Q(Z|X)]</math> to the prior distribution <math>P_Z</math> as measured by the divergence <math>\mathcal{D}_Z(P_Z, Q_Z)</math> and, (2) make sure that the latent space vectors encoded contain enough information so that the reconstruction of the data points are of high quality. The figure below illustrates this:<br />
<br />
[[File:ka2khan_figure_1.png|800px|thumb|center|Figure 1]]<br />
<br />
Figure 1: Both VAE and WAE have objectives which are composed of two terms. The two terms are the reconstruction cost and the regularizer term which penalizes the divergence between <math>P_Z</math> and <math>Q_Z</math>. VAE forces <math>Q(Z|X = x)</math> to match <math>P_Z</math> for the the different training examples drawn from <math>P_X</math>. As shown in the figure above, every red ball representing <math>Q_z</math> is forced to match <math>P_Z</math> depicted as whitish triangles. This causes intersection among red balls and results in reconstruction problems. On the other hand, WAE coerces the mixture <math>Q_Z := \int{Q(Z|X)\ dP_X}</math> to match <math>P_Z</math> as shown in the figure above. This provides a better chance of the encoded latent codes to have more distance between them. As a consequence of this, higher reconstruction quality is achieved.<br />
<br />
==Preliminaries and Notations==<br />
Authors use calligraphic letters to denote sets (for example, <math>\mathcal{X}</math>), capital letters for random variables (for example, <math>X</math>), and lower case letters for the values (for example, <math>x</math>). Probability distributions are are also denoted with capital letters (for example, <math>P(X)</math>) and the corresponding densities are denoted with lowercase letter (for example, <math>p(x)</math>).<br />
<br />
Several measure of difference between probability distributions are also used by the authors. These include f-divergences given by <math>D_f(p_X||p_G) := \int{f(\frac{p_X(x)}{p_G(x)})p_G(x)}dx\ \text{where}\ f:(0, \infty) &rarr; \mathcal{R}</math> is any convex function satisfying <math>f(1) = 0</math>. Other divergences used include KL divergence (<math>D_{KL}</math>) and Jensen-Shannon (<math>D_{JS}</math>) divergences.<br />
<br />
==Optimal Transport and its Dual Formations==<br />
<br />
A rich class of measure of distances between probability distributions is motivated by the optimal transport problem. One such formulation of the optimal transport problem is the Kantovorich's formulation given by:<br />
<br />
<math><br />
W_c(P_X, P_G) := \underset{\Gamma \in \mathcal{P}(X \sim P_X ,Y \sim P_G)}{inf} \mathbb{E}_{(X,Y) \sim \Gamma}[c(X,Y)],<br />
\text{where} \ c(x, y): \mathcal{X} \times \mathcal{X} &rarr; \mathcal{R_{+}}<br />
</math><br />
<br />
is any measurable cost function and <math>\mathcal{P}(X \sim P_X, Y \sim P_G)</math> is a set of all joint distributions of (X, Y) with marginals <math>P_X\ \text{and}\ P_G</math> respectively.<br />
<br />
A particularly interesting case is when <math>(\mathcal{X}, d)</math> is metric space and <math>c(x, y) = d^p(x, y)\ \text{for}\ p &ge; 1</math>. In this case <math>W_p</math>, the <math>p-th</math> root of <math>W_c</math>, is called the p-Wasserstein distance.<br />
<br />
When <math>c(x, y) = d(x, y)</math> the following Kantorovich-Rubinstein duality holds:<br />
<br />
<math>W_1(P_X, P_G) = \underset{f \in \mathcal{F}_L}{sup} \mathbb{E}_{X \sim P_x}[f(X)] = \mathbb{E}_{Y \sim P_G}[f(Y)]</math><br />
where <math>\mathcal{F}_L</math> is the class of all bounded 1-Lipschitz functions on <math>(\mathcal{X}, d)</math>.<br />
<br />
==Application to Generative Models: Wasserstein auto-encoders==<br />
The intuition behind modern generative models like VAEs and GANs is that they try to minimize specific distance measures between the data distribution <math>P_X</math> and the model <math>P_G</math>. Unfortunately, with the current knowledge and tools, it is usually really hard or even impossible to calculate most of the standard discrepancy measures especially when <math>P_X</math> is not known and <math>P_G</math> is parametrized by deep neural networks. Having said that, there are certain tricks available which can be employed to get around that difficulty.<br />
<br />
For KL-divergence <math>D_{KL}(P_X, P_G)</math> minimization, or equivalently the marginal log-likelihood <math>E_{P_X}[log_{P_G}(X)]</math> maximization, one can use the famous variational lower bound which provides a theoretically grounded framework. This has been used quite successfully by the VAEs. In the general case of minimizing f-divergence <math>D_f(P_X, P_G)</math>, using its dual formulation along with f-GANs and adversarial training is viable. Finally, OT cost <math>W_c(P_X, P_G)</math> can be minimized by using the Kantorovich-Rubinstein duality expressed as an adversarial objective. The Wasserstein-GAN implement this idea.<br />
<br />
In this paper, the authors focus on the latent variable models <math>P_G</math> given by a two step procedure. First, a code <math>Z</math> is sampled from a fixed distribution <math>P_Z</math> on a latent space </math>\mathcal{Z}</math>. Second step is to map <math>Z</math> to the image <math>X \in \mathcal{X} = \mathcal{R}^d</math> with a (possibly random) transformation. This gives us a density of the form<br />
<br />
<math><br />
p_G(x) := \int\limits_{\mathcal{Z}}{p_G(x|z)p_z(z)}dz,\ \forall x \in \mathcal{X}, <br />
</math><br />
<br />
provided all the probablities involved are properly defined. In order to keep things simple, the authors focus on non-random decoders, i.e., the generative models <math>P_G(X|Z)</math> deterministically map <math>Z</math> to <math>X = G(Z)</math> using a fixed map <math>G: \mathcal{Z} &rarr; \mathcal{X}</math>. Similar results hold for the random decoders as shown by the authors in the appendix B.1.<br />
<br />
Working under the model defined in the preceding paragraph, the authors find that OT cost takes a much simpler form as the transportation plan factors through the map <math>G:</math> instead of finding a coupling <math>\Gamma</math> between two random variables in the <math>\mathcal{X}</math> space, one given by the distribution <math>P_X</math> and the other by the the distribution <math>P_G</math>, it is enough to find a conditional distribution <math>Q(Z|X)</math> such that its <math>Z</math> marginal, <math>Q_Z)Z) := \mathbb{E}_{X \sim P_X}[Q(Z|X)]</math> is the same as the prior distribution <math>P_Z</math>. This is formalized by the theorem given below. The theorem given below was proven in [4] by the authors.<br />
<br />
'''Theorem 1.''' For <math>P_G</math> defined as above with deterministic <math>P_G(X|Z)</math> and any function <math>G:\mathcal{Z} &rarr; \mathcal{X}</math><br />
<br />
<math><br />
\underset{\Gamma \in \mathcal{P}(X \sim P_X ,Y \sim P_G)}{inf} \mathbb{E}_{(X,Y) \sim \Gamma}[c(X,Y)] = \underset{Q: Q_Z = P_Z}{inf} \mathbb{E}_{P_X} \mathbb{E}_{Q(Z|X)}[c(X, G(Z))]<br />
</math><br />
<br />
where <math>Q_Z</math> is the marginal distribution of <math>Z</math> when <math>X \sim P_X</math> and <math>Z \sim Q(Z|X)</math>.<br />
<br />
According to the authors, the result above allows optimization over random encoders <math>Q(Z|X)</math> instead of optimizing overall couplings of <math>X</math> and <math>Y</math>. Both problems are still constrained. To find a numerical solution, the authors relax the constraints on <math>Q_Z</math> by adding a regularizer term to the objective. This gives them the WAE objective:<br />
<br />
<math><br />
D_{WAE}(P_X, P_G) := \underset{Q(Z|X) \in \mathcal{Q}}{inf} \mathbb{E}_{P_X} \mathbb{E}_{Q(Z|X)}[c(X, G(Z))] + \lambda \cdot \mathcal{D}_Z(Q_Z, P_Z)<br />
</math><br />
<br />
where <math>\mathcal{Q}</math> is any nonparametric set of probabilistic encoders, <math>\mathcal{D}_Z</math> is an arbitrary measure of distance between <math>Q_Z</math> and <math>P_Z</math>, and <math>\lambda &gt; 0</math> is a hyperparameter. As is the case with the VAEs, the<br />
authors propose using deep neural networks to parameterize both encoders <math>Q</math> and decoders <math>G</math>. Note that, unlike VAEs, WAE allows for non-random encoders deterministically mapping their inputs to their latent codes.<br />
<br />
The authors propose two different regularizers <math>\mathcal{D}_Z(Q_Z, P_Z)</math><br />
<br />
===GAN-based <math>\mathcal{D}_z</math>===<br />
One of the option is to use <math>\mathcal{D}_Z(Q_Z, P_Z) = \mathcal{D}_{JS}(Q_Z, P_Z)</math> along with adversarial training for estimation. In particular, the discriminator (adversary) is used in the latent space <math>\mathcal{Z}</math> to classify "true" points sampled for <math>P_X</math> and "fake" ones samples from <math>Q_Z</math>. This leads to the WAE-GAN as described in Algorithm 1 listed below. Even though WAE-GAN still uses max-min optimization, one positive feature is that it moves the adversary from the input (pixel) space <math>\mathcal{X}</math> to the latent space <math>\mathcal{Z}</math>. Additionally, the true latent space distribution <math>P_Z</math> might have a nice shape with a single mode (for a Gaussian prior), making the task of matching much easier as opposed to matching an unknown, complex, and possibly multi-modal distributions which is usually the case in GANs. This leads to the second penalty.<br />
<br />
===MMD-based <math>\mathcal{D}_z</math>===<br />
For a positive-definite reproducing kernel <math>k: \mathcal{Z} \times \mathcal{Z} &rarr; \mathcal{R}</math>, the maximum mean discrepancy (MMD) is defined as<br />
<br />
<math><br />
MMD_k(P_Z, Q_Z) = \left \Vert \int \limits_{\mathcal{Z}} {k(z, \cdot)dP_Z(z)} - \int \limits_{\mathcal{Z}} {k(z, \cdot)dQ_Z(z)} \right \|_{\mathcal{H}_k}<br />
</math>,<br />
<br />
where <math>\mathcal{H}_k</math> is the RKHS (reproducing kernel Hilbert space) of real-valued functions mappings <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. If <math>k</math> is characteristi then <math>MMD_k</math> defines a metric and can be used as a distance measure. The authors propose to use <math>\mathcal{D}_Z(P_Z, Q_Z) = MMD_k(P_Z, Q_Z)</math>. MMD also have an unbiased U-statistic estimator which can be used alongwith stochastic gradient descent (SGD) methods. This gives us WAE-MMD as described in the Algorithm 2 listed below. Note that MMD is known to perform well when matching high dimensional standard normal distributions, so it is expected that this penalty will work well when the prior <math>P_Z</math> is Gaussian.<br />
<br />
[[File:ka2khan_figure_2.png|800px|thumb|center|Algorithms]]<br />
<br />
=Related Work=<br />
==Literature on auto-encoders==<br />
Classical unregularized auto-encoders have an objective function which only tries to minimize the reconstruction cost. This results in distinct data points being encoded into distinct zones distributed chaotically across the latent space <math>\mathcal{Z}</math>. The latent space <math>\mathcal{Z}</math> in this scenario contains huge "holes" for which the decoder <math>P_G(X|Z)</math> has never been trained. In general, the encoder trained this way do not provide terribly useful representations and sampling from the latent space <math>\mathcal{Z}</math> becomes a difficult task [12].<br />
<br />
VAEs [1] minimize the KL-divergence <math>D_{KL}(P_X, P_G)</math> which consists of the reconstruction cost and the regularizer <math>\mathbb{E}_{P_X}[D_{KL}(Q(|X), P_Z)]</math>. The regularizer penalizes the difference in the encoded training images and the prior <math>P_Z</math>. But this penalty still does not guarantee that the overall encoded distribution matches the prior distribution as WAE does. In addition, VAEs require a non-degenerate (i.e. non-deterministic) Gaussian encoders along with random decoders. Another paper [11] later, proposed a method which allows the use of non-Gaussian encoders with VAEs. In the meanwhile, WAE minimizes <math>W_{c}(P_X, P_G)</math> and allows probabilistic and deterministic encoder and decoder pairs.<br />
<br />
When parameters are appropriately defined, WAE is able to generalize AAE in two ways: it can use any cost function in the input space and use any discrepancy measure <math>D_Z</math> in latent space <math>Z</math> other than the adversarial one.<br />
<br />
There has been work done on regularized auto-encoders called InfoVAE [14], which has objective similar to [4] but using different motivations and arguments.<br />
<br />
WAEs explicitly define the cost function <math>c(x,y)</math>, whereas VAEs rely on an implicitly through a negative log likelihood term. It theoretically can induce any arbitrary cost function, but in practice can require an estimation of the normalizing constant that can be different for values of <math>z</math>.<br />
<br />
==Literature on OT==<br />
[15] provides methods for computing OT cost for large-scale data using SGD and sampling. The WGAN [5] proposes a generative model which minimizes 1-Wasserstein distance <math>W_1(P_X, P_G)</math>. The WGAN algorithm does not provide an encoder and cannot be easily applied to any arbitrary cost <math>W_C</math>. The model proposed in [5] uses the dual form, in contrast, the model proposed in this paper uses the primal form. The primal form allows the use of any arbitrary cost function <math>c</math> and naturally, comes with an encoder. <br />
<br />
In order to compute <math>W_c(P_X, P_G)</math> or <math>W_1(P_X, P_G)</math>, the model needs to handle various non-trivial constraints, various methods has be proposed in the literature ([5], [2], 8[], [16], [15], [17], [18]) to avoid this difficulty .<br />
<br />
==Literature on GANs==<br />
A lot of the GAN variations which have been proposed in the literature come without an encoder. Examples include WGAN and f-GAN. These models are deficient in cases where a reconstruction of latent space is needed to use the learned manifold.<br />
<br />
There have been numerous models proposed in the literature which try to combine the adversarial training of GANs with auto-encoder architectures. Some examples are [19], [20], [21], and [22]. There has also been work done in which reproducing kernels have been used in the context of GANS ([23], [24]).<br />
<br />
=Experiments=<br />
Experiments were used to empirically evaluate the proposed WAE model. The authors conducted experiments using the following two real-world datasets: (1) MNIST [27] made up of 70k images, and (2) CelebA [28] consisting of approximately 203k images. <br />
<br />
The main evaluation criteria were to see if the WAE model can simultaneously achieve: <br />
<br />
<ol><br />
<li>accurate reconstruction of the data points</li><br />
<li>resonable geometry of the latent manifold</li><br />
<li>generation of high quality random samples</li><br />
</ol><br />
<br />
For the model to generalize well (1) and (2) should be met on both the training and test data set.<br />
<br />
The proposed model achieve reasonably good results as highlighted in the figures given below:<br />
<br />
[[File:ka2khan_figure_3.png|800px|thumb|center|Using CelebA dataset]]<br />
<br />
[[File:ka2khan_figure_4.png|800px|thumb|center|Using CelebA dataset, FID (Fréchet Inception Distance<br />
[32]): smaller is better, sharpness: larger is better]]<br />
<br />
=Conclusion=<br />
The authors proposed a new class of algorithms for building a generative model called Wasserstein Autoencoders based optimal transport cost. They related the newly proposed model to the existing probabilistic modeling techniques. They empirically evaluated the proposed models using two real-world datasets. They compared the results obtained using their proposed model with the results obtained using VAEs on the same dataset to show that the proposed models generate sample images of higher quality in addition to being easier to train and having good reconstruction quality of the data points.<br />
<br />
The authors claim that in future work, they will further explore the criteria for matching the encoding distribution <math>Q_Z</math> to the prior distribution <math>P_Z</math>, evaluate whether it is feasible to adversarially train the cost function <math>c</math>in the input space <math>\mathcal{X}</math>, and a theoretical analysis of the dual-formations for WAE-GAN and WAE-MMD.<br />
<br />
=Future Work=<br />
Following the work of this paper, another generative model was introduced by [34] that is based on the concept of optimal transport. Optimal transport is basically the distances between probability distributions by transporting one of the distributions to the other (and hence the name of optimal transport). Then, a new simple model called "Sliced-Wasserstein Autoencoders" (SWAE) is presented, which is easily implemented, and provides the capabilities of Wasserstein Autoencoders.<br />
<br />
([https://openreview.net/forum?id=HkL7n1-0b]) The results from MNIST and CelebA datasets look convincing, though could include additional evaluation to compare the adversarial loss with the straightforward MMD metric and potentially discuss their pros and cons. In some sense, given the challenges in evaluating and comparing closely related auto-encoder solutions, the authors could design demonstrative experiments for cases where Wassersterin distance helps and may be its potential limitations.<br />
<br />
=References=<br />
[1] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.<br />
<br />
[2] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR, 2016.<br />
<br />
[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.<br />
<br />
[4] O. Bousquet, S. Gelly, I. Tolstikhin, C. J. Simon-Gabriel, and B. Schölkopf. From optimal transport to generative modeling: the VEGAN cookbook, 2017.<br />
<br />
[5] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017.<br />
<br />
[6] C. Villani. Topics in Optimal Transportation. AMS Graduate Studies in Mathematics, 2003.<br />
<br />
[7] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.<br />
<br />
[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Domoulin, and A. Courville. Improved training of wasserstein GANs, 2017.<br />
<br />
[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.<br />
<br />
[10] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.<br />
<br />
[11] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks, 2017.<br />
<br />
[12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, 35, 2013.<br />
<br />
[13] M. D. Hoffman and M. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In NIPS Workshop on Advances in Approximate Bayesian Inference, 2016.<br />
<br />
[14] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information maximizing variational autoencoders, 2017.<br />
<br />
[15] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pages 3432–3440, 2016. <br />
<br />
[16] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.<br />
<br />
[17] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: geometry and kantorovich formulation. arXiv preprint arXiv:1508.05216, 2015.<br />
<br />
[18] Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems and a new hellinger-kantorovich distance between positive measures. arXiv preprint arXiv:1508.07941, 2015.<br />
<br />
[19] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.<br />
<br />
[20] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. In ICLR, 2017.<br />
<br />
[21] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generator-encoder networks, 2017.<br />
<br />
[22] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks, 2017.<br />
<br />
[23] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015. <br />
<br />
[24] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.<br />
<br />
[25] R. Reddi, A. Ramdas, A. Singh, B. Poczos, and L. Wasserman. On the high-dimensional power of a linear-time two sample test under mean-shift alternatives. In AISTATS, 2015.<br />
<br />
[26] C. L. Li, W. C. Chang, Y. Cheng, Y. Yang, and B. Poczos. Mmd gan: Towards deeper understanding of moment matching network, 2017.<br />
<br />
[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86(11), pages 2278–2324, 1998.<br />
<br />
[28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.<br />
<br />
[29] D. P. Kingma and J. Lei. Adam: A method for stochastic optimization, 2014.<br />
<br />
[30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.<br />
<br />
[31] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.<br />
<br />
[32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.<br />
<br />
[33] B. Poole, A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for GANs, 2016.<br />
<br />
[34] S. Kolouri, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoencoder: An embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=40333Countering Adversarial Images Using Input Transformations2018-11-20T09:08:59Z<p>D35kumar: /* Introduction */ Adding more information about the model [Technical]</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
[[File:Attack.PNG|200px |]],<br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
==Adversarial Attacks==<br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshov distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshov distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image.<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defences. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defence strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. THe authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the preprocessing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping- Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. It is also concluded that randomness is particularly important in developing strong defenses. Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=40332Countering Adversarial Images Using Input Transformations2018-11-20T08:59:30Z<p>D35kumar: /* Motivation */</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et. al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are Public<br />
<br />
'''Black Box Attack''': Adversary does not have access to the model.<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
[[File:Attack.PNG|200px |]],<br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
==Adversarial Attacks==<br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshov distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshov distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image.<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defences. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defence strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. THe authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the preprocessing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping- Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. It is also concluded that randomness is particularly important in developing strong defenses. Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=40331Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-11-20T08:53:26Z<p>D35kumar: /* Learning the latent noise model */</p>
<hr />
<div>==Introduction==<br />
<br />
<br />
The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models are not good enough and tend to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, the process of collecting real-world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors: <br />
<br />
1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes<br />
<br />
2. Collect training data in 6 different homes and testing data in 3 homes<br />
<br />
3. Propose a method for modelling the noise in the labelled data<br />
<br />
4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses a patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.<br />
<br />
(Note: On Cross Entropy:<br />
<br />
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool. This is optimal, in that we can't encode the symbols using fewer bits on average.<br />
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:<br />
<br />
\begin{align}<br />
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}<br />
\end{align}<br />
<br />
Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. Apart from the patch <math> I_{P} </math> and grasp information (x, y, θ), they use information like image of the entire scene, ID of the robot and the location of the raw pixel. They argue that the image of the full scene could contain some essential information about the system. They used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.<br />
<br />
===Training details===<br />
<br />
They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.<br />
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.<br />
<br />
==Results==<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).<br />
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high-quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high-quality sensing.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for benchmarking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in a different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
==Related work==<br />
<br />
Over the last few years, the interest of scaling up robot learning with large-scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large-scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.<br />
<br />
Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high-quality depth as input which seems to be a barrier for practical use of robots in real environments. High-quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low-cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is, in fact, a fine-tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.<br />
<br />
[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor-critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Searching_For_Efficient_Multi_Scale_Architectures_For_Dense_Image_Prediction&diff=38804Searching For Efficient Multi Scale Architectures For Dense Image Prediction2018-11-13T07:03:55Z<p>D35kumar: /* Motivation */ Editorial</p>
<hr />
<div><br />
[Need add more pics and references]<br />
=Introduction=<br />
<br />
The design of neural network architectures is an important component for the success of machine learning and data science projects. In recent years, the field of Neural Architecture Search (NAS) has emerged, which is to automatically find an optimal neural architecture for a given task in a well-defined architecture space. The resulting architectures have often outperform networks designed by human experts on tasks such as image classification and natural language processing. [2,3,4] <br />
<br />
This paper presents a meta-learning technique to have computers search for a neural architecture that performs well on the task of dense image segmentation.<br />
<br />
=Motivation=<br />
<br />
The part of deep neural networks(DNN) success is largely due to the fact that it greatly reduces the work in feature engineering. This is because DNNs have the ability to extract useful features given the raw input. However, this creates a new paradigm to look at - network engineering. In order to extract significant features, an appropriate network architecture must be used. Hence, the engineering work is shifted from feature engineering to network architecture design for better abstraction of features.<br />
<br />
The motivation for NAS is to establish a guiding theory behind how to design the optimal network architecture. Given that there is an <br />
abundant amount of computational resources available, an intuitive solution is to define a finite search space for a computer to search for optimal network structures and hyperparameters.<br />
<br />
=Related Work =<br />
<br />
This paper focuses on two main literature research topics. One is the neural architecture search (NAS) and the other is the Multi-Scale representation for sense image prediction. Neural architecture search trains a controller network to generate neural architectures. The following are the important research directions in this area: <br />
<br />
1) One kind of research transfers architectures learned on a proxy dataset to more challenging datasets and demonstrates superior performance over many human-invented architectures.<br />
<br />
2) Reinforcement learning, evolutionary algorithms and sequential model-based optimization have been used to learn network structures. <br />
<br />
3) Some other works focus on increasing model size, sharing model weights to accelerate model search or a continuous relaxation of the architecture representation. <br />
<br />
4) Some recent methods focus on proposing methods for embedding an exponentially large number of architectures in a grid arrangement for semantic segmentation tasks. <br />
<br />
In the area of multi-scale representation for dense image prediction the following are useful prior work: <br />
<br />
1) State of the art methods use Convolutional Neural Nets. There are different methods proposed for supplying global features and context information to perform pixel level classification. <br />
<br />
2) Some approaches focus on how to efficiently encode multi-scale context information in a network architecture like designing models that take an input an image pyramid so that large-scale objects are captured by the downsampled image. <br />
<br />
3) Research also tried to come up with a theme on how best to tune the architecture to extract context information. Some works focus on sampling rates in atrous convolution to encode multi-scale context. Some others build context module by gradually increasing the rate on top of belief maps.<br />
<br />
=NAS Overview=<br />
<br />
NAS essentially turns a design problem into a search problem. As a search problem in general, we need a clear definition of three things:<br />
<ol><br />
<li> Search space</li><br />
<li> Search strategy</li><br />
<li> Performance Estimation Strategy</li><br />
</ol><br />
The search space is easy to understand, for instance defining a hyperparameter space to consider for our optimal solution. In the field of NAS, the search space is heavily dependent on the assumptions we make on the neural architecture. The search strategy details how to explore the search space. The evaluation strategy refers to taking an input of a set of hyperparameters, and from there evaluating how well our model fits. In the field of NAS, it is typical to find architectures that achieve high predictive performance on unseen data. [5]<br />
<br />
We will take a deep dive into the above three dimensions of NAS in the following sections<br />
<br />
=Search Space=<br />
There are typically three ways of defining the search space.<br />
==Chain-structured neural networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen_Shot_2018-11-10_at_6.03.00_PM.png|100px]]<br />
</div><br />
[5]<br />
The chain structed network can be viewd as sequence of n layers, where the layer <math> i</math> recives input from <math> i-1</math> layer and the output serves<br />
the input to layer <math> i+1</math>.<br />
<br />
The search space is then parametrized by:<br />
1) Number of layers n<br />
2) Type of operations can be executed on each layer<br />
3) Hyperparameters associated with each layer<br />
<br />
==Multi-branch networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.08 PM.png|200px]]</div><br />
<br />
[5]<br />
This architecture allows significantly more degrees of freedom. It allows shortcuts and parallel branches. Some of the ideas are inspired by human hand-crafted networks. For example, the shortcut from shallow layers directly to the deep layers are coming from networks like ResNet [6]<br />
<br />
The search space includes the search space of chain-structured networks, with additional freedom of adding shortcut connections and allowing parallel branches to exist.<br />
<br />
==Cell/Block ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.31 PM.png|300px]]</div><br />
<br />
[6]<br />
This architecture defines a cell which is used as the building block of the neural network. A good analogy here is to think a cell as a lego piece, and you can define different types of cells as different<br />
lego pieces. And then you can combine them together to form a new neural structure. <br />
<br />
The search space includes the internal structure of the cell and how to combine these blocks to form the resulting architecture.<br />
<br />
==What they used in this paper ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.50.04 PM.png|500px]]<br />
</div><br />
[1]<br />
This paper's approach is very close to the Cell/Block approach above<br />
<br />
The paper defines two components: The "network backbone" and a cell unit called "DPC" which represented by a directed acyclic graph (DAG) with five branches (i.e. the optimal value). The network backbone's job is to take input image as a tensor and return a feature map f that is a supposedly good abstraction of the image. The DPC is what they introduced in this paper, short for Dense Prediction Cell. In theory, the search space consists of what they choose for the network backbone and the internal structure of the DPC. In practice, they just used MobileNet and Modified Xception net as the backbone. So the search space only consists of the internal structure of the DPC cell.<br />
<br />
For the network backbone, they simply choose from existing mature architecture. They used networks like Mobile-Net-v2, Inception-Net, and e.t.c. For the structure of DPC, they define a smaller unit of called branch. A branch is a triple of (Xi, OP, Yi), where Xi is an input tensor, and OP is the operation that can be done on the tensor, and Yi is the resulting after the Operation. <br />
<br />
In the paper, they set each DPC consists of 5 cells for the balance expressivity and computational tractability.<br />
<br />
The operator space, OP, is defined as the following set of functions:<br />
<ol><br />
<li>Convolution with a 1 × 1 kernel.</li><br />
<li>3×3 atrous separable convolution with rate rh×rw, where rh and rw ∈ {1, 3, 6, 9, . . . , 21}. </li><br />
<li>Average spatial pyramid pooling with grid size gh × gw, where gh and gw ∈ {1, 2, 4, 8}. </li><br />
</ol><br />
<br />
<br />
The operation spae has 1 + 8×8 + 4×4 = 81 functions in the operator space, resulting in i × 81 possible options. Therefore, for B = 5, the search space size is B! × 81^B ≈ 4.2 × 10^11 configurations.<br />
<br />
=Search Strategy=<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:search_strategy.png|500px]]<br />
</div><br />
<br />
There are some common search strategies used in the field of NAS, such as Reinforcement learning, Random search, Evolution algorithm, and Grid Search.<br />
<br />
The one they used in the paper is Random Search. It basically samples points from the search space uniformly at random as well as sampling<br />
some points that are close to the current observed best point. Intuitively it makes sense because it combines exploration and exploitation. When you sample points close to the current<br />
optimal point, you are doing exploitation. And when you sample points randomly, you are doing exploration.<br />
<br />
<br />
They quoted from another paper that claims random search performs the random search is competitive with reinforcement learning and other learning techniques. [7] <br />
In the implementation, they used Google's black box optimization tool Google vizier. It is not open source, but there is an open source implementation of it [8]<br />
<br />
=Performance Evaluation Strategy=<br />
<br />
The evaluation in this particular task is very tricky. The reason is we are evaluating neural network here. In order to evaluate it, we need to train it first. And we are doing pixel level classification on images with high resolutions, so the naive approach would require a tremendous amount of computational resources. <br />
<br />
The way they solve it in the paper is defining a proxy task. The proxy task is a task that requires sufficient less computational resources, while can still give a good estimate of the performance of the network. In most image classical tasks of NAS, the proxy<br />
task is to train the network on images of lower resolution. The assumption is, if the network performs well on images with lower density, it should reasonably perform well on images with higher resolution.<br />
<br />
However, the above approach does not work on this case. The reason is that the dense prediction tasks innately require high-resolution images as training data. The approach used in the paper is the flowing:<br />
<ol><br />
<li> Use a smaller backbone for proxy task</li><br />
<li> caching the feature maps produced by the network backbone on the training set and directly building a single DPC on top of it </li><br />
<li> Early stopping train for 30k iterations with a batch size of 8</li><br />
</ol><br />
<br />
If training on the large-scale backbone without fixing the weights of the backbone, they would need one week to train a network on a P100 GPU, but now they cut down the proxy task to be run 90 min. Then they rank the selected architectures, choosing the top 50 and do <br />
a full evaluation on it.<br />
<br />
The evaluation metric they used is called mIOU, which is pixel level intersection over union. Which just the area of the intersection<br />
of the ground truth and the prediction over the area of the union of the ground truth and the prediction.<br />
<br />
=Result=<br />
<br />
This method achieves state of art performances in many datasets. The following table quantifies the gain on performance on many datasets.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.51.14 PM.png| 400px]]<br />
</div><br />
The chose to train on modified Xception network as a backbone, and the following are the resulting architecture for the DPC.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-12 at 12.32.05 PM.png|400px]]<br />
</div><br />
<br />
Table 2 describes the results on scene parsing dataset. It sets a new state-of-the-art performance of 82.7% mIOU and outperforms other state-of-the-art models across 11 of the 19 categories.<br />
<br />
Table 3 describes the results on person part segmentation dataset. It achieve the state-of-the-art performance of 71.34% mIOU and outperforms other state-of-the-art models across 6 of the 7 categories.<br />
<br />
Table 4 describes the results on semantic image segmentation dataset. It achieve the state-of-the-art performance of 87.9% mIOU and outperforms other state-of-the-art models across 6 of the 20 categories.<br />
<br />
As we can see, the searched DPC model achieves better performance (measured by mIOU) with less than half of the computational resources(parameters), and 37% less of operations (add and multiply).<br />
<br />
=Future work=<br />
The author suggests that when increasing the number of branches in the DPC, there might be a further gain on the performance on the<br />
image segmentation task. However, although the random search in an exponentially growing space may become more challenging. There may need more intelligent search strategy.<br />
<br />
The author hope that this architecture search techniques can be ported into other domains such as depth prediction and object detection to achieve similar gains over human-invented designs.<br />
<br />
=Critique=<br />
<br />
1. Rich man's game<br />
<br />
The technique described in the paper can only be applied by parties with abundant computational resources, like Google, Facebook, Microsoft, and e.t.c. For small research groups and companies, this method is not that useful due to the lack of computational power. Future improvement will be needed on the design an even more efficient proxy task that can tell whether a network will perform<br />
well that requires fewer computations. <br />
<br />
2. Benefit/Cost ratio<br />
<br />
The technique here does outperform human designed network in many cases, but the gain is not huge. In Cityscapes dataset, the performance gain is 0.7%, wherein PASCAL-Person-Part dataset, the gain is 3.7%, and the PASCAL VOC 2012 dataset, it does not outperform human experts. (All measured by mIOU) Even though the push of the state-of-the-art is always something that worth celebrating, <br />
but in practice, one would argue after spending so many resources doing the search, the computer should achieve superhuman performance. (Like Chess Engine vs Chess Grand Master). In practice, one may simply go with the current state-of-the-art model to avoid the expensive search cost.<br />
<br />
3. Still Heavily influenced by Human Bias<br />
<br />
When we define the search space, we introduced human bias. Firstly, the network backbone is chosen from previous matured architectures, which may not actually be optimal. Secondly, the internal branches in the DPC also consist with layers whose operations are defined by us humans, and we define these operations based on previous experience. That also prevents the search algorithm to find something revolutionary.<br />
<br />
4. May have the potential to take away entry-level data science jobs.<br />
<br />
If there is a significant reduction in the search cost, it will be more cost effective to apply NAS rather than hire data scientists. Once matured, this technology will have the potential to take away entry-level data science jobs and make data science jobs only possessed by high-level researchers. <br />
<br />
There are some real-world applications that already deploy NAS techniques in production. Two good examples are Google AutoML and Microsoft Custom Vision AI.<br />
[9, 10]<br />
<br />
=References=<br />
1. Searching For Efficient Multi-Scale Architectures For Dense Image Prediction, [[https://arxiv.org/abs/1809.04184]].<br />
<br />
2. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018.<br />
<br />
3. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.<br />
<br />
4. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.<br />
<br />
5. Neural Architecture Search: A Survey [[https://arxiv.org/abs/1808.05377]]<br />
<br />
6. Deep Residual Learning for Image Recognition [[https://arxiv.org/pdf/1512.03385.pdf]]<br />
<br />
7. .J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.<br />
In the implementation wise, they used a Google vizier, which is a search tool for black box optimization. [D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In SIGKDD, 2017.]<br />
<br />
8. Github implementation of Google Vizer, a black-box optimization tool [https://github.com/tobegit3hub/advisor.]<br />
<br />
9. AutoML: https://cloud.google.com/automl/ <br />
<br />
10. Custom-vision: https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Searching_For_Efficient_Multi_Scale_Architectures_For_Dense_Image_Prediction&diff=38792Searching For Efficient Multi Scale Architectures For Dense Image Prediction2018-11-13T05:12:41Z<p>D35kumar: /* Introduction */ (Editorial)</p>
<hr />
<div><br />
[Need add more pics and references]<br />
=Introduction=<br />
<br />
The design of neural network architectures is an important component for the success of machine learning and data science projects. In recent years, the field of Neural Architecture Search(NAS) has emerged, which looks at automatically finding an optimal neural architecture in a given task in a well-defined architecture space. Often, the resulting architectures have outperformed networks designed by human experts on many tasks such as image classification and natural language processing.[2,3,4] <br />
<br />
This paper presents a meta-learning technique of letting computers search for a neural architecture that performs well on the task of dense image segmentation.<br />
<br />
=Motivation=<br />
<br />
Deep Neural network's success is largely due to the fact that it greatly reduces the work in Feature Engineering, as DNN has the ability to automatically extract useful features given the raw input. However, it created a<br />
new type of engineering work - network engineering. In order to successfully extract features, you need to have the corresponding network architecture. So what really happened is the engineering work is shifted from feature engineering to how to design the network so that it can better abstract useful features.<br />
<br />
The motivation for NAS is that since there is no guiding theory on how to design the optimal network architecture, given that we have <br />
abundant computational resources, one intuitive solution is to define a finite search space and let the computers do the dirty work of searching for structures and hyperparameters.<br />
<br />
=Related Work =<br />
<br />
This paper focuses on two main literature research topics. One is the neural architecture search (NAS) and the other is the Multi-Scale representation for sense image prediction. Neural architecture search trains a controller network to generate neural architectures. The following are the important research directions in this area: <br />
<br />
1) One kind of research transfers architectures learned on a proxy dataset to more challenging datasets and demonstrates superior performance over many human-invented architectures.<br />
<br />
2) Reinforcement learning, evolutionary algorithms and sequential model-based optimization have been used to learn network structures. <br />
<br />
3) Some other works focus on increasing model size, sharing model weights to accelerate model search or a continuous relaxation of the architecture representation. <br />
<br />
4) Some recent methods focus on proposing methods for embedding an exponentially large number of architectures in a grid arrangement for semantic segmentation tasks. <br />
<br />
In the area of multi-scale representation for dense image prediction the following are useful prior work: <br />
<br />
1) State of the art methods use Convolutional Neural Nets. There are different methods proposed for supplying global features and context information to perform pixel level classification. <br />
<br />
2) Some approaches focus on how to efficiently encode multi-scale context information in a network architecture like designing models that take an input an image pyramid so that large-scale objects are captured by the downsampled image. <br />
<br />
3) Research also tried to come up with a theme on how best to tune the architecture to extract context information. Some works focus on sampling rates in atrous convolution to encode multi-scale context. Some others build context module by gradually increasing the rate on top of belief maps.<br />
<br />
=NAS Overview=<br />
<br />
NAS essentially turns a design problem into a search problem. As a search problem in general, we need a clear definition of three things:<br />
<ol><br />
<li> Search space</li><br />
<li> Search strategy</li><br />
<li> Performance Estimation Strategy</li><br />
</ol><br />
The search space is very intuitive to understand such as in what hyperparameter space we should look for our optimal solution. In the field of NAS, the search space is heavily dependent on the assumption we make on the neural architecture. The search strategy details how to explore the search space. The evaluation strategy is when we find a set of hyperparameters, how should we evaluate our model. In the field of NAS, it is typically to find architectures that achieve high predictive performance on unseen data. [5]<br />
<br />
We will take a deep dive into the above three dimensions of NAS in the following sections<br />
<br />
=Search Space=<br />
There are typically three ways of defining the search space.<br />
==Chain-structured neural networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen_Shot_2018-11-10_at_6.03.00_PM.png|100px]]<br />
</div><br />
[5]<br />
The chain structed network can be viewd as sequence of n layers, where the layer <math> i</math> recives input from <math> i-1</math> layer and the output serves<br />
the input to layer <math> i+1</math>.<br />
<br />
The search space is then parametrized by:<br />
1) Number of layers n<br />
2) Type of operations can be executed on each layer<br />
3) Hyperparameters associated with each layer<br />
<br />
==Multi-branch networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.08 PM.png|200px]]</div><br />
<br />
[5]<br />
This architecture allows significantly more degrees of freedom. It allows shortcuts and parallel branches. Some of the ideas are inspired by human hand-crafted networks. For example, the shortcut from shallow layers directly to the deep layers are coming from networks like ResNet [6]<br />
<br />
The search space includes the search space of chain-structured networks, with additional freedom of adding shortcut connections and allowing parallel branches to exist.<br />
<br />
==Cell/Block ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.31 PM.png|300px]]</div><br />
<br />
[6]<br />
This architecture defines a cell which is used as the building block of the neural network. A good analogy here is to think a cell as a lego piece, and you can define different types of cells as different<br />
lego pieces. And then you can combine them together to form a new neural structure. <br />
<br />
The search space includes the internal structure of the cell and how to combine these blocks to form the resulting architecture.<br />
<br />
==What they used in this paper ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.50.04 PM.png|500px]]<br />
</div><br />
[1]<br />
This paper's approach is very close to the Cell/Block approach above<br />
<br />
The paper defines two components: The "network backbone" and a cell unit called "DPC" which represented by a directed acyclic graph (DAG) with five branches (i.e. the optimal value). The network backbone's job is to take input image as a tensor and return a feature map f that is a supposedly good abstraction of the image. The DPC is what they introduced in this paper, short for Dense Prediction Cell. In theory, the search space consists of what they choose for the network backbone and the internal structure of the DPC. In practice, they just used MobileNet and Modified Xception net as the backbone. So the search space only consists of the internal structure of the DPC cell.<br />
<br />
For the network backbone, they simply choose from existing mature architecture. They used networks like Mobile-Net-v2, Inception-Net, and e.t.c. For the structure of DPC, they define a smaller unit of called branch. A branch is a triple of (Xi, OP, Yi), where Xi is an input tensor, and OP is the operation that can be done on the tensor, and Yi is the resulting after the Operation. <br />
<br />
In the paper, they set each DPC consists of 5 cells for the balance expressivity and computational tractability.<br />
<br />
The operator space, OP, is defined as the following set of functions:<br />
<ol><br />
<li>Convolution with a 1 × 1 kernel.</li><br />
<li>3×3 atrous separable convolution with rate rh×rw, where rh and rw ∈ {1, 3, 6, 9, . . . , 21}. </li><br />
<li>Average spatial pyramid pooling with grid size gh × gw, where gh and gw ∈ {1, 2, 4, 8}. </li><br />
</ol><br />
<br />
<br />
The operation spae has 1 + 8×8 + 4×4 = 81 functions in the operator space, resulting in i × 81 possible options. Therefore, for B = 5, the search space size is B! × 81^B ≈ 4.2 × 10^11 configurations.<br />
<br />
=Search Strategy=<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:search_strategy.png|500px]]<br />
</div><br />
<br />
There are some common search strategies used in the field of NAS, such as Reinforcement learning, Random search, Evolution algorithm, and Grid Search.<br />
<br />
The one they used in the paper is Random Search. It basically samples points from the search space uniformly at random as well as sampling<br />
some points that are close to the current observed best point. Intuitively it makes sense because it combines exploration and exploitation. When you sample points close to the current<br />
optimal point, you are doing exploitation. And when you sample points randomly, you are doing exploration.<br />
<br />
<br />
They quoted from another paper that claims random search performs the random search is competitive with reinforcement learning and other learning techniques. [7] <br />
In the implementation, they used Google's black box optimization tool Google vizier. It is not open source, but there is an open source implementation of it [8]<br />
<br />
=Performance Evaluation Strategy=<br />
<br />
The evaluation in this particular task is very tricky. The reason is we are evaluating neural network here. In order to evaluate it, we need to train it first. And we are doing pixel level classification on images with high resolutions, so the naive approach would require a tremendous amount of computational resources. <br />
<br />
The way they solve it in the paper is defining a proxy task. The proxy task is a task that requires sufficient less computational resources, while can still give a good estimate of the performance of the network. In most image classical tasks of NAS, the proxy<br />
task is to train the network on images of lower resolution. The assumption is, if the network performs well on images with lower density, it should reasonably perform well on images with higher resolution.<br />
<br />
However, the above approach does not work on this case. The reason is that the dense prediction tasks innately require high-resolution images as training data. The approach used in the paper is the flowing:<br />
<ol><br />
<li> Use a smaller backbone for proxy task</li><br />
<li> caching the feature maps produced by the network backbone on the training set and directly building a single DPC on top of it </li><br />
<li> Early stopping train for 30k iterations with a batch size of 8</li><br />
</ol><br />
<br />
If training on the large-scale backbone without fixing the weights of the backbone, they would need one week to train a network on a P100 GPU, but now they cut down the proxy task to be run 90 min. Then they rank the selected architectures, choosing the top 50 and do <br />
a full evaluation on it.<br />
<br />
The evaluation metric they used is called mIOU, which is pixel level intersection over union. Which just the area of the intersection<br />
of the ground truth and the prediction over the area of the union of the ground truth and the prediction.<br />
<br />
=Result=<br />
<br />
This method achieves state of art performances in many datasets. The following table quantifies the gain on performance on many datasets.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.51.14 PM.png| 400px]]<br />
</div><br />
The chose to train on modified Xception network as a backbone, and the following are the resulting architecture for the DPC.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-12 at 12.32.05 PM.png|400px]]<br />
</div><br />
<br />
As we can see, the searched DPC model achieves better performance (measured by mIOU) with less than half of the computational resources(parameters), and 37% less of operations (add and multiply).<br />
<br />
=Future work=<br />
The author suggests that when increasing the number of branches in the DPC, there might be a further gain on the performance on the<br />
image segmentation task. However, although the random search in an exponentially growing space may become more challenging. There may need more intelligent search strategy.<br />
<br />
The author hope that this architecture search techniques can be ported into other domains such as depth prediction and object detection to achieve similar gains over human-invented designs.<br />
<br />
=Critique=<br />
<br />
1. Rich man's game<br />
<br />
The technique described in the paper can only be applied by parties with abundant computational resources, like Google, Facebook, Microsoft, and e.t.c. For small research groups and companies, this method is not that useful due to the lack of computational power. Future improvement will be needed on the design an even more efficient proxy task that can tell whether a network will perform<br />
well that requires fewer computations. <br />
<br />
2. Benefit/Cost ratio<br />
<br />
The technique here does outperform human designed network in many cases, but the gain is not huge. In Cityscapes dataset, the performance gain is 0.7%, wherein PASCAL-Person-Part dataset, the gain is 3.7%, and the PASCAL VOC 2012 dataset, it does not outperform human experts. (All measured by mIOU) Even though the push of the state-of-the-art is always something that worth celebrating, <br />
but in practice, one would argue after spending so many resources doing the search, the computer should achieve superhuman performance. (Like Chess Engine vs Chess Grand Master). In practice, one may simply go with the current state-of-the-art model to avoid the expensive search cost.<br />
<br />
3. Still Heavily influenced by Human Bias<br />
<br />
When we define the search space, we introduced human bias. Firstly, the network backbone is chosen from previous matured architectures, which may not actually be optimal. Secondly, the internal branches in the DPC also consist with layers whose operations are defined by us humans, and we define these operations based on previous experience. That also prevents the search algorithm to find something revolutionary.<br />
<br />
4. May have the potential to take away entry-level data science jobs.<br />
<br />
If there is a significant reduction in the search cost, it will be more cost effective to apply NAS rather than hire data scientists. Once matured, this technology will have the potential to take away entry-level data science jobs and make data science jobs only possessed by high-level researchers. <br />
<br />
There are some real-world applications that already deploy NAS techniques in production. Two good examples are Google AutoML and Microsoft Custom Vision AI.<br />
[9, 10]<br />
<br />
=References=<br />
1. Searching For Efficient Multi-Scale Architectures For Dense Image Prediction, [[https://arxiv.org/abs/1809.04184]].<br />
<br />
2. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018.<br />
<br />
3. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.<br />
<br />
4. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.<br />
<br />
5. Neural Architecture Search: A Survey [[https://arxiv.org/abs/1808.05377]]<br />
<br />
6. Deep Residual Learning for Image Recognition [[https://arxiv.org/pdf/1512.03385.pdf]]<br />
<br />
7. .J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.<br />
In the implementation wise, they used a Google vizier, which is a search tool for black box optimization. [D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In SIGKDD, 2017.]<br />
<br />
8. Github implementation of Google Vizer, a black-box optimization tool [https://github.com/tobegit3hub/advisor.]<br />
<br />
9. AutoML: https://cloud.google.com/automl/ <br />
<br />
10. Custom-vision: https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTI-VIEW_DATA_GENERATION_WITHOUT_VIEW_SUPERVISION&diff=38726MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION2018-11-12T22:45:02Z<p>D35kumar: /* Experiments and Results */ (Editorial)</p>
<hr />
<div>This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018. <br />
<br />
==Introduction==<br />
<br />
===Motivation===<br />
High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same.<br />
<br />
===Related Work===<br />
<br />
The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view, also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually<br />
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps learning such model, yet prevents their use on many datasets where this information is not available. <br />
<br />
===Contributions===<br />
<br />
The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.<br />
<br />
==Paper Overview==<br />
<br />
===Background===<br />
<br />
The paper uses concept of the popular GAN (Generative Adverserial Networks) proposed by Goodfellow et al.(2014).<br />
<br />
GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs were introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”<br />
<br />
Let us denote <math>X</math> an input space composed of multidimensional samples x e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples G(z) where <math>z ∼ p_z</math>. A GAN defines, in addition to G, a discriminator function D : X → [0; 1] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled following <math>p_G</math>, while the generator is learned to fool the discriminator D. Usually both G and D are implemented with neural networks. The objective function is based on the following adversarial criterion:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div><br />
<br />
where <math>p_x</math> is the empirical data distribution on X .<br />
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.<br />
<br />
CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. Now, to add such a condition to both generator and discriminator, we will simply feed some vector y, into both networks. Hence, both the discriminator D(X,y) and generator G(z,y) are jointly distributed with y. <br />
<br />
Now, the objective function of CGAN is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div><br />
<br />
The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor z and generating x only based on the condition y, exhibiting an almost deterministic behaviour.<br />
<br />
===Generative Multi-View Model===<br />
<br />
''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object,and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure<br />
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.<br />
<br />
The paper defines two tasks here to be done: <br />
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.<br />
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axis,<br />
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labelling information.<br />
<br />
''' Generative Multi-view Model: '''<br />
<br />
Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. Objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).<br />
<br />
The key idea that authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from a same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As, real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.<br />
<br />
Now, the objective function of GMV Model is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div><br />
<br />
Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view<br />
<br />
<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div><br />
<br />
===Conditional Generative Model (C-GMV)===<br />
<br />
C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model.This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math><br />
<br />
Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).<br />
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN are known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.<br />
<br />
The Objective function for C-GMV will be:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div><br />
<br />
==Experiments and Results==<br />
<br />
The authors have given an exhaustive set of results and experiments.<br />
<br />
Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used is if two samples correspond to the same object or not.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div><br />
<br />
<br />
Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015).<br />
<br />
Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models, authors have built baselines working in the same conditions as the models in this paper. In addition, models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views. <br />
<br />
===Generating Multiple Contents and Views===<br />
<br />
Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div><br />
<br />
<br />
Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div><br />
<br />
Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view<br />
does not modify the content and vice versa.<br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div><br />
<br />
===Generating Multiple Views of a Given Object===<br />
<br />
The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample, and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row the input sample is shown in the left column. New views are generated from that input and shown on the right.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div><br />
<br />
=== Evaluation of the Quality of Generated Samples ===<br />
<br />
The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div><br />
<br />
==Conclusion==<br />
<br />
The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors.<br />
<br />
==Future Work==<br />
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings. <br />
<br />
==Critique==<br />
<br />
The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.<br />
<br />
However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.<br />
<br />
==References==<br />
<br />
[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018<br />
<br />
[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.<br />
<br />
[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.<br />
<br />
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTI-VIEW_DATA_GENERATION_WITHOUT_VIEW_SUPERVISION&diff=38724MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION2018-11-12T22:40:42Z<p>D35kumar: /* Background */ Editorial</p>
<hr />
<div>This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018. <br />
<br />
==Introduction==<br />
<br />
===Motivation===<br />
High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same.<br />
<br />
===Related Work===<br />
<br />
The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view, also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually<br />
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps learning such model, yet prevents their use on many datasets where this information is not available. <br />
<br />
===Contributions===<br />
<br />
The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.<br />
<br />
==Paper Overview==<br />
<br />
===Background===<br />
<br />
The paper uses concept of the popular GAN (Generative Adverserial Networks) proposed by Goodfellow et al.(2014).<br />
<br />
GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs were introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”<br />
<br />
Let us denote <math>X</math> an input space composed of multidimensional samples x e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples G(z) where <math>z ∼ p_z</math>. A GAN defines, in addition to G, a discriminator function D : X → [0; 1] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled following <math>p_G</math>, while the generator is learned to fool the discriminator D. Usually both G and D are implemented with neural networks. The objective function is based on the following adversarial criterion:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div><br />
<br />
where <math>p_x</math> is the empirical data distribution on X .<br />
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.<br />
<br />
CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. Now, to add such a condition to both generator and discriminator, we will simply feed some vector y, into both networks. Hence, both the discriminator D(X,y) and generator G(z,y) are jointly distributed with y. <br />
<br />
Now, the objective function of CGAN is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div><br />
<br />
The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor z and generating x only based on the condition y, exhibiting an almost deterministic behaviour.<br />
<br />
===Generative Multi-View Model===<br />
<br />
''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object,and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure<br />
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.<br />
<br />
The paper defines two tasks here to be done: <br />
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.<br />
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axis,<br />
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labelling information.<br />
<br />
''' Generative Multi-view Model: '''<br />
<br />
Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. Objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).<br />
<br />
The key idea that authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from a same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As, real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.<br />
<br />
Now, the objective function of GMV Model is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div><br />
<br />
Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view<br />
<br />
<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div><br />
<br />
===Conditional Generative Model (C-GMV)===<br />
<br />
C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model.This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math><br />
<br />
Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).<br />
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN are known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.<br />
<br />
The Objective function for C-GMV will be:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div><br />
<br />
==Experiments and Results==<br />
<br />
The authors have given a exhaustive set of results and experiments.<br />
<br />
Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used if two samples correspond or not to the same object.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div><br />
<br />
<br />
Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015).<br />
<br />
Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models authors have built baselines working in the same conditions as our models. In addition models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views. <br />
<br />
===Generating Multiple Contents and Views===<br />
<br />
Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div><br />
<br />
<br />
Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div><br />
<br />
Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view<br />
does not modify the content and vice versa.<br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div><br />
<br />
===Generating Multiple Views of a Given Object===<br />
<br />
The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample, and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row the input sample is shown in the left column. New views are generated from that input and shown on the right.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div><br />
<br />
=== Evaluation of the Quality of Generated Samples ===<br />
<br />
The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div><br />
<br />
==Conclusion==<br />
<br />
The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors.<br />
<br />
==Future Work==<br />
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings. <br />
<br />
==Critique==<br />
<br />
The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.<br />
<br />
However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.<br />
<br />
==References==<br />
<br />
[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018<br />
<br />
[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.<br />
<br />
[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.<br />
<br />
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTI-VIEW_DATA_GENERATION_WITHOUT_VIEW_SUPERVISION&diff=38723MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION2018-11-12T22:37:35Z<p>D35kumar: /* Contributions */ Adding contribution [Technical]</p>
<hr />
<div>This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018. <br />
<br />
==Introduction==<br />
<br />
===Motivation===<br />
High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same.<br />
<br />
===Related Work===<br />
<br />
The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view, also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually<br />
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps learning such model, yet prevents their use on many datasets where this information is not available. <br />
<br />
===Contributions===<br />
<br />
The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.<br />
<br />
==Paper Overview==<br />
<br />
===Background===<br />
<br />
The paper uses concept of the popular GAN (Generative Adverserial Networks) proposed by Goodfellow et al.(2014).<br />
<br />
GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs were introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”<br />
<br />
Let us denote <math>X</math> an input space composed of multidimensional samples x e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples G(z) where <math>z ∼ p_z</math>. A GAN defines, in addition to G, a discriminator function D : X → [0; 1] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled following <math>p_G</math>, while the generator is learned to fool the discriminator D. Usually both G and D are implemented with neural networks. The objective function is based on the following adversarial criterion:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div><br />
<br />
where <math>p_x</math> is the empirical data distribution on X .<br />
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.<br />
<br />
CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. Now, to add such a condition to both generator and discriminator, we will simply feed some vector y, into both networks. Hence, both the discriminator D(X,y) and generator G(z,y) are jointly distributed with y. <br />
<br />
Now, the objective function of CGAN is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div><br />
<br />
The paper also suggests that, many studies have reported that on when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution,mostly ignoring the latent factor z and generating x only based on the condition y, exhibiting an almost deterministic behaviour.<br />
<br />
===Generative Multi-View Model===<br />
<br />
''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object,and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure<br />
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.<br />
<br />
The paper defines two tasks here to be done: <br />
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.<br />
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axis,<br />
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labelling information.<br />
<br />
''' Generative Multi-view Model: '''<br />
<br />
Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. Objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).<br />
<br />
The key idea that authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from a same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As, real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.<br />
<br />
Now, the objective function of GMV Model is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div><br />
<br />
Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view<br />
<br />
<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div><br />
<br />
===Conditional Generative Model (C-GMV)===<br />
<br />
C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model.This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math><br />
<br />
Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).<br />
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN are known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.<br />
<br />
The Objective function for C-GMV will be:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div><br />
<br />
==Experiments and Results==<br />
<br />
The authors have given a exhaustive set of results and experiments.<br />
<br />
Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used if two samples correspond or not to the same object.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div><br />
<br />
<br />
Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015).<br />
<br />
Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models authors have built baselines working in the same conditions as our models. In addition models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views. <br />
<br />
===Generating Multiple Contents and Views===<br />
<br />
Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div><br />
<br />
<br />
Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div><br />
<br />
Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view<br />
does not modify the content and vice versa.<br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div><br />
<br />
===Generating Multiple Views of a Given Object===<br />
<br />
The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample, and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row the input sample is shown in the left column. New views are generated from that input and shown on the right.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div><br />
<br />
=== Evaluation of the Quality of Generated Samples ===<br />
<br />
The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div><br />
<br />
==Conclusion==<br />
<br />
The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors.<br />
<br />
==Future Work==<br />
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings. <br />
<br />
==Critique==<br />
<br />
The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.<br />
<br />
However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.<br />
<br />
==References==<br />
<br />
[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018<br />
<br />
[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.<br />
<br />
[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.<br />
<br />
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Attend_and_Predict:_Understanding_Gene_Regulation_by_Selective_Attention_on_Chromatin&diff=38649Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin2018-11-12T05:14:48Z<p>D35kumar: /* Main Contributions */ Adding more details to the contribution of the paper [Technical]</p>
<hr />
<div>= Background =<br />
<br />
Gene regulation is the process of controlling which genes in a cell's DNA are turned 'on' (expressed) or 'off' (not expressed). By this process functional product such as a protein is created. Even though all the cells of a multicellular organism (e.g., humans) contain the same DNA, different types of cells in that organism may express very different sets of genes. As a result each cell types have distinct functionality. In other words how a cell operates depends upon the genes expressed in that cell. Many factors including ‘Chromatin modification marks’ influence which genes are abundant in that cell.<br />
<br />
The function of chromatin is to efficiently wraps DNA around histones into a condensed volume to fit into the nucleus of a cell and protect the DNA structure and sequence during cell division and replication. Different chemical modifications in the histones of the chromatin, known as histone marks, changes spatial arrangement of the condensed DNA structure. Which in turn affects the gene’s expression of the histone mark’s neighboring region. Histone marks can promote (obstruct) the gene to be turned on by making the gene region accessible (restricted). This section of the DNA, where histone marks can potentially have an impact, is known as DNA flanking region or ‘gene region’ which is considered to cover 10k base pair centered at the transcription start site (TSS) (i.e., 5k base pair in each direction). Unlike genetic mutations, histone modifications are reversible [1]. Therefore, understanding influence of histone marks in determining gene regulation can assist in developing drugs for genetic disease. <br />
<br />
<br />
= Introduction = <br />
<br />
Revolution in genomic technologies now enables us to profile genome-wide chromatin mark signals. Therefore, biologists can now measure gene expressions and chromatin signals of the ‘gene region’ for different cell types covering whole human genome. The Roadmap Epigenome Project (REMC, publicly available) [2] recently released 2,804 genome-wide datasets of 100 separate “normal” (not diseased) human cells/tissues, among which 166 datasets are gene expression reads and the rest are signal reads of various histone marks. The goal is to understand which histone marks are the most important and how they interact together in gene regulation for each cell type.<br />
<br />
Signal reads for histone marks are high-dimensional and spatially structured. Influence of a histone modification mark can be anywhere in the gene region (covering 10k bp centering the TSS). It is important to understand how the impact of the mark on gene expression varies over the gene region. In other words, how histone signals over the gene region impacts the gene expression. There are different types of histone marks in human chromatin that can have an influence on gene regulation. Researchers have found five standard histone proteins. These five histone proteins can be altered in different combinations with different chemical modifications resulting in a large number of distinct histone modification marks. Different histone modification marks can act as a module to interact with each other and influence the gene expression.<br />
<br />
<br />
This paper proposes an attention-based deep learning model to find how this chromatin factors/ histone modification marks contributes to the gene expression of a particular cell. AttentiveChrome[3] utilizes a hierarchy of multiple LSTM to discover interactions between signals of each histone marks, and learn dependencies among the marks on expressing a gene. The authors included two levels of soft attention mechanism, (1) to attend to the most relevant signals of a histone mark, and (2) to attend to the important marks and their interactions. <br />
<br />
== Main Contributions ==<br />
The contributions of this work can be summarized as follows:<br />
<br />
* More accurate predictions than the state-of-the-art baselines. This is measured using datasets from REMC on 56 different cell types.<br />
* Better interpretation than the state-of-the-art methods for visualizing deep learning model. They compute the correlation of the attention scores of the model with the mark signal from REMC. <br />
* Like the application of attention models previously in indirectly hinting the parts of the input that the model deemed important, AttentiveChrome can too explain it's decisions by hinting at “what” and “where” it has focused.<br />
* This is the first time that the attention based deep learning approach is applied to a problem in molecular biology.<br />
<br />
= Previous Works = <br />
<br />
Previous machine learning approaches to address this issue either (1) failed to model the spatial dependencies among the marks, or (2) required additional feature analysis. <br />
<br />
= AttentiveChrome: Model Formulation =<br />
<br />
The authors proposed an end-to-end architecture which has the ability to simultaneously attend and predict. This method incorporates recurrent neural networks (RNN) composed of LSTM units to model the sequential spatial dependencies of the gene regions and predict gene expression level from The embedding vector, <math> h_t </math>, output of an LSTM module encodes the learned representation of the feature dependencies from the time step 0 to <math> t </math>. For this task, each bin position of the gene region is considered as a time step.<br />
<br />
The proposed AttentiveChrome framework contains following 5 important modules:<br />
<br />
* Bin-level LSTM encoder encoding the bin positions of the gene region (one for each HM mark)<br />
* Bin-level <math> \alpha </math>-Attention across all bin positions (one for each HM mark)<br />
* HM-level LSTM encoder (one encoder encoding all HM marks)<br />
* HM-level <math> \beta </math>-Attention among all HM marks (one)<br />
* The final classification module<br />
<br />
Figure 1 (Supplementary Figure 2) presents the overview of the proposed AttentiveChrome framework.<br />
<br />
<br />
<br />
<br />
== Input and Output ==<br />
<br />
Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where <math> +1 </math> denotes gene is expressed (on) and <math> -1 </math> denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix <math> \textbf{X} </math> of size <math> M \times T </math>, where <math> M </math> is the number of HM marks considered in the input, and <math> T </math> is the number of bin positions taken into account to represent the gene region. The <math> j^{th} </math> row of the vector <math> \textbf{X} </math>, <math> x_j</math>, represents sequentially structured signals from the <math> j^{th} </math> HM mark, where <math> j\in \{1, \cdots, M\} </math>. Therefore, <math> x_j^t</math>, in the matrix <math> \textbf{X} </math> represents the value from the <math> t^{th}</math> bin belonging to the <math> j^{th} </math> HM mark, where <math> t\in \{1, \cdots, T\} </math>. If the training set contains <math>N_{tr} </math> labeled pairs, the <math> n^{th} </math> is specified as <math>( X^n, y^n)</math>, where <math> X^n </math> is a matrix of size <math> M \times T </math> and <math> y^n \in \{ -1, +1 \} </math> is the binary label, and <math> n \in \{ 1, \cdots, N_{tr} \} </math>.<br />
<br />
Figure 2 exhibits the input feature, and the output of AttentiveChrome for a particular gene (one sample).<br />
<br />
<br />
<br />
<br />
<br />
== Bin-Level Encoder (one LSTM for each HM) ==<br />
The sequentially ordered elements (each element actually is a bin position) of the gene region of <math> n^{th} </math> gene is represented by the <math> j_{th} </math> row vector <math> x^j </math>. The authors considered each bin position as a time step for LSTM. This study incorporates bidirectional LSTM to model the overall dependencies among a total of <math> T </math> bin positions in the gene region. The bidirectional LSTM contains two LSTMs<br />
* A forward LSTM, <math> \overrightarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_1^j </math> to <math> x_T^j </math>, which outputs the embedding vector <math> \overrightarrow{h^t_j} </math>, of size <math> d </math> for each bin <math> t </math><br />
* A reverse LSTM, <math> \overleftarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_T^j </math> to <math> x_1^j </math>, which outputs the embedding vector <math> \overleftarrow{h^j_t} </math>, of size <math> d </math> for each bin <math> t </math><br />
<br />
The final output of this layer, embedding vector at <math> t^{th} </math> bin for the <math> j^{th} </math> HM, <math> h^j_t </math>, of size <math> d </math>, is obtained by concatenating the two vectors from the both directions. Therefore, <math> h^j_t = [ \overrightarrow{h^j_t}, \overleftarrow{h^j_t}]</math>.<br />
<br />
<br />
<br />
== Bin-Level <math> \alpha</math>-attention ==<br />
<br />
Each bin contributes differently in the encoding of the entire <math> j^{th} </math> mark. To highlight the most important bins for prediction a soft attention weight vector <math> \alpha^j </math> of size <math> T </math> is learned for each <math> j </math>. To calculated the soft weight <math> \alpha^j_t </math>, for each <math> t </math>, the embedding vectors <math> \{h^j_1, \cdots, h^j_t \} </math> of all the bins are utilized. The following equation is used:<br />
<br />
<math> \alpha^j_t = \frac{exp(\textbf{W}_b h^j_t)}{\sum_{i=1}^T{exp(\textbf{W}_b h^j_i)}} </math><br />
<br />
<br />
The parameter <math> W_b </math> is learned alongside during the process. Therefore, the <math> j^{th} </math> HM mark can be represented by <math> m^j = \sum_{t=1}^T{\alpha^j_t \times h^j_t}</math>. Here, <math> h^j_t</math> is the embedding vector and <math> \alpha^t_j </math> is the importance weight of the <math> t^{th} </math> bin in the representation of the <math> j^{th} </math> HM mark. Intuitively <math> \textbf{W}_b </math> will learn the cell type.<br />
<br />
<br />
== HM-level Encoder (one LSTM) ==<br />
<br />
Studies observed that HMs work cooperatively to provoke or subdue gene expression [5]. The HM-level encoder utilizes one bidirectional LSTM to capture this relationship between the HMs. To formulate the sequential dependency a random sequence is imagined as the authors did not find influence of any specific ordering of the HMs. The representation <math> m_j </math>of the <math> j^{th} </math> HM, <math> HM_j </math>, which is calculated from the bin-level attention layer, is the input of this step. This set based encoder outputs an embedding vector <math> s^j </math> of size <math> d’ </math>, which is the encoding for the <math> j^{th} </math> HM.<br />
<br />
<math> s^j = [ \overrightarrow{LSTM_s}(m_j), \overleftarrow{LSTM_s}(m_j) ] </math><br />
<br />
The dependencies between <math> j^{th} </math> HM and the other HM marks are encoded in <math> s^j </math>, whereas <math> m^j </math> from the previous step encodes the bin dependencies of the <math> j^{th} </math> HM.<br />
<br />
HM-Level <math> \beta</math>-attention<br />
This second soft attention level finds the important HM marks for classifying a gene’s expression by learning the importance weights, <math> \beta_j </math>, for each <math> HM_j </math>, where <math> j \in \{ 1, \cdots, M \} </math>. The equation is <br />
<br />
<math> \beta^j = \frac{exp(\textbf{W}_s s^j)}{\sum_{i=1}^M{exp(\textbf{W}_s s^j)}} </math><br />
<br />
The HM-level context parameter <math> \textbf{W}_s </math> is trained jointly in the process. Intuitively <math> \textbf{W}_s </math> learns how the HMs are significant for a cell type. Finally the entire gene region is encoded in a hidden representation <math> \textbf{v} </math>, using the weighted sum of the embedding of all HM marks. <br />
<br />
<br />
<math> \textbf{v} = \sum_{j=1}^MT{\beta^j \times s^j}</math><br />
<br />
<br />
== End-to-end training ==<br />
<br />
The embedding vector <math> \textbf{v} </math> is fed to a simple classification module, <math> f(\textbf{v}) = </math>softmax<math> (\textbf{W}_c\textbf{v}+b_c) </math>, where <math> \textbf{W}_c </math>, and <math> b_c </math> are learnable parameters. The output is the probability of gene expression being high (expressed) or low (suppressed).<br />
The whole model including the attention modules are differentiable. Thus backpropagation can perform end-to-end learning trivially. Negative log-likelihood loss function is minimized in the learning.<br />
<br />
= Reference =<br />
<br />
[1] Andrew J Bannister and Tony Kouzarides. Regulation of chromatin by histone modifications. Cell research, 21(3):381–395, 2011.<br />
<br />
[2] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.<br />
<br />
[3] Singh, Ritambhara, et al. "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin." Advances in Neural Information Processing Systems. 2017.<br />
<br />
[4] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.<br />
<br />
[5] Joanna Boros, Nausica Arnoult, Vincent Stroobant, Jean-François Collet, and Anabelle Decottignies. Polycomb repressive complex 2 and h3k27me3 cooperate with h3k9 methylation to maintain heterochromatin protein 1α at chromatin. Molecular and cellular biology, 34(19):3662–3674, 2014.<br />
<br />
[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning<br />
to align and translate. arXiv preprint arXiv:1409.0473, 2014.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Attend_and_Predict:_Understanding_Gene_Regulation_by_Selective_Attention_on_Chromatin&diff=38648Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin2018-11-12T05:02:36Z<p>D35kumar: /* Introduction */ (Editorial)</p>
<hr />
<div>= Background =<br />
<br />
Gene regulation is the process of controlling which genes in a cell's DNA are turned 'on' (expressed) or 'off' (not expressed). By this process functional product such as a protein is created. Even though all the cells of a multicellular organism (e.g., humans) contain the same DNA, different types of cells in that organism may express very different sets of genes. As a result each cell types have distinct functionality. In other words how a cell operates depends upon the genes expressed in that cell. Many factors including ‘Chromatin modification marks’ influence which genes are abundant in that cell.<br />
<br />
The function of chromatin is to efficiently wraps DNA around histones into a condensed volume to fit into the nucleus of a cell and protect the DNA structure and sequence during cell division and replication. Different chemical modifications in the histones of the chromatin, known as histone marks, changes spatial arrangement of the condensed DNA structure. Which in turn affects the gene’s expression of the histone mark’s neighboring region. Histone marks can promote (obstruct) the gene to be turned on by making the gene region accessible (restricted). This section of the DNA, where histone marks can potentially have an impact, is known as DNA flanking region or ‘gene region’ which is considered to cover 10k base pair centered at the transcription start site (TSS) (i.e., 5k base pair in each direction). Unlike genetic mutations, histone modifications are reversible [1]. Therefore, understanding influence of histone marks in determining gene regulation can assist in developing drugs for genetic disease. <br />
<br />
<br />
= Introduction = <br />
<br />
Revolution in genomic technologies now enables us to profile genome-wide chromatin mark signals. Therefore, biologists can now measure gene expressions and chromatin signals of the ‘gene region’ for different cell types covering whole human genome. The Roadmap Epigenome Project (REMC, publicly available) [2] recently released 2,804 genome-wide datasets of 100 separate “normal” (not diseased) human cells/tissues, among which 166 datasets are gene expression reads and the rest are signal reads of various histone marks. The goal is to understand which histone marks are the most important and how they interact together in gene regulation for each cell type.<br />
<br />
Signal reads for histone marks are high-dimensional and spatially structured. Influence of a histone modification mark can be anywhere in the gene region (covering 10k bp centering the TSS). It is important to understand how the impact of the mark on gene expression varies over the gene region. In other words, how histone signals over the gene region impacts the gene expression. There are different types of histone marks in human chromatin that can have an influence on gene regulation. Researchers have found five standard histone proteins. These five histone proteins can be altered in different combinations with different chemical modifications resulting in a large number of distinct histone modification marks. Different histone modification marks can act as a module to interact with each other and influence the gene expression.<br />
<br />
<br />
This paper proposes an attention-based deep learning model to find how this chromatin factors/ histone modification marks contributes to the gene expression of a particular cell. AttentiveChrome[3] utilizes a hierarchy of multiple LSTM to discover interactions between signals of each histone marks, and learn dependencies among the marks on expressing a gene. The authors included two levels of soft attention mechanism, (1) to attend to the most relevant signals of a histone mark, and (2) to attend to the important marks and their interactions. <br />
<br />
== Main Contributions ==<br />
The contributions of this work can be summarized as follows:<br />
<br />
* More accurate predictions than the state-of-the-art baselines.<br />
* Better interpretation than the state-of-the0art methods for visualizing deep learning model.<br />
* Can explain the model's decision by provisioning “what” and “where” it has focused.<br />
* First attention based deep learning method for a problem of molecular biology.<br />
<br />
= Previous Works = <br />
<br />
Previous machine learning approaches to address this issue either (1) failed to model the spatial dependencies among the marks, or (2) required additional feature analysis. <br />
<br />
= AttentiveChrome: Model Formulation =<br />
<br />
The authors proposed an end-to-end architecture which has the ability to simultaneously attend and predict. This method incorporates recurrent neural networks (RNN) composed of LSTM units to model the sequential spatial dependencies of the gene regions and predict gene expression level from The embedding vector, <math> h_t </math>, output of an LSTM module encodes the learned representation of the feature dependencies from the time step 0 to <math> t </math>. For this task, each bin position of the gene region is considered as a time step.<br />
<br />
The proposed AttentiveChrome framework contains following 5 important modules:<br />
<br />
* Bin-level LSTM encoder encoding the bin positions of the gene region (one for each HM mark)<br />
* Bin-level <math> \alpha </math>-Attention across all bin positions (one for each HM mark)<br />
* HM-level LSTM encoder (one encoder encoding all HM marks)<br />
* HM-level <math> \beta </math>-Attention among all HM marks (one)<br />
* The final classification module<br />
<br />
Figure 1 (Supplementary Figure 2) presents the overview of the proposed AttentiveChrome framework.<br />
<br />
<br />
<br />
<br />
== Input and Output ==<br />
<br />
Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where <math> +1 </math> denotes gene is expressed (on) and <math> -1 </math> denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix <math> \textbf{X} </math> of size <math> M \times T </math>, where <math> M </math> is the number of HM marks considered in the input, and <math> T </math> is the number of bin positions taken into account to represent the gene region. The <math> j^{th} </math> row of the vector <math> \textbf{X} </math>, <math> x_j</math>, represents sequentially structured signals from the <math> j^{th} </math> HM mark, where <math> j\in \{1, \cdots, M\} </math>. Therefore, <math> x_j^t</math>, in the matrix <math> \textbf{X} </math> represents the value from the <math> t^{th}</math> bin belonging to the <math> j^{th} </math> HM mark, where <math> t\in \{1, \cdots, T\} </math>. If the training set contains <math>N_{tr} </math> labeled pairs, the <math> n^{th} </math> is specified as <math>( X^n, y^n)</math>, where <math> X^n </math> is a matrix of size <math> M \times T </math> and <math> y^n \in \{ -1, +1 \} </math> is the binary label, and <math> n \in \{ 1, \cdots, N_{tr} \} </math>.<br />
<br />
Figure 2 exhibits the input feature, and the output of AttentiveChrome for a particular gene (one sample).<br />
<br />
<br />
<br />
<br />
<br />
== Bin-Level Encoder (one LSTM for each HM) ==<br />
The sequentially ordered elements (each element actually is a bin position) of the gene region of <math> n^{th} </math> gene is represented by the <math> j_{th} </math> row vector <math> x^j </math>. The authors considered each bin position as a time step for LSTM. This study incorporates bidirectional LSTM to model the overall dependencies among a total of <math> T </math> bin positions in the gene region. The bidirectional LSTM contains two LSTMs<br />
* A forward LSTM, <math> \overrightarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_1^j </math> to <math> x_T^j </math>, which outputs the embedding vector <math> \overrightarrow{h^t_j} </math>, of size <math> d </math> for each bin <math> t </math><br />
* A reverse LSTM, <math> \overleftarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_T^j </math> to <math> x_1^j </math>, which outputs the embedding vector <math> \overleftarrow{h^j_t} </math>, of size <math> d </math> for each bin <math> t </math><br />
<br />
The final output of this layer, embedding vector at <math> t^{th} </math> bin for the <math> j^{th} </math> HM, <math> h^j_t </math>, of size <math> d </math>, is obtained by concatenating the two vectors from the both directions. Therefore, <math> h^j_t = [ \overrightarrow{h^j_t}, \overleftarrow{h^j_t}]</math>.<br />
<br />
<br />
<br />
== Bin-Level <math> \alpha</math>-attention ==<br />
<br />
Each bin contributes differently in the encoding of the entire <math> j^{th} </math> mark. To highlight the most important bins for prediction a soft attention weight vector <math> \alpha^j </math> of size <math> T </math> is learned for each <math> j </math>. To calculated the soft weight <math> \alpha^j_t </math>, for each <math> t </math>, the embedding vectors <math> \{h^j_1, \cdots, h^j_t \} </math> of all the bins are utilized. The following equation is used:<br />
<br />
<math> \alpha^j_t = \frac{exp(\textbf{W}_b h^j_t)}{\sum_{i=1}^T{exp(\textbf{W}_b h^j_i)}} </math><br />
<br />
<br />
The parameter <math> W_b </math> is learned alongside during the process. Therefore, the <math> j^{th} </math> HM mark can be represented by <math> m^j = \sum_{t=1}^T{\alpha^j_t \times h^j_t}</math>. Here, <math> h^j_t</math> is the embedding vector and <math> \alpha^t_j </math> is the importance weight of the <math> t^{th} </math> bin in the representation of the <math> j^{th} </math> HM mark. Intuitively <math> \textbf{W}_b </math> will learn the cell type.<br />
<br />
<br />
== HM-level Encoder (one LSTM) ==<br />
<br />
Studies observed that HMs work cooperatively to provoke or subdue gene expression [5]. The HM-level encoder utilizes one bidirectional LSTM to capture this relationship between the HMs. To formulate the sequential dependency a random sequence is imagined as the authors did not find influence of any specific ordering of the HMs. The representation <math> m_j </math>of the <math> j^{th} </math> HM, <math> HM_j </math>, which is calculated from the bin-level attention layer, is the input of this step. This set based encoder outputs an embedding vector <math> s^j </math> of size <math> d’ </math>, which is the encoding for the <math> j^{th} </math> HM.<br />
<br />
<math> s^j = [ \overrightarrow{LSTM_s}(m_j), \overleftarrow{LSTM_s}(m_j) ] </math><br />
<br />
The dependencies between <math> j^{th} </math> HM and the other HM marks are encoded in <math> s^j </math>, whereas <math> m^j </math> from the previous step encodes the bin dependencies of the <math> j^{th} </math> HM.<br />
<br />
HM-Level <math> \beta</math>-attention<br />
This second soft attention level finds the important HM marks for classifying a gene’s expression by learning the importance weights, <math> \beta_j </math>, for each <math> HM_j </math>, where <math> j \in \{ 1, \cdots, M \} </math>. The equation is <br />
<br />
<math> \beta^j = \frac{exp(\textbf{W}_s s^j)}{\sum_{i=1}^M{exp(\textbf{W}_s s^j)}} </math><br />
<br />
The HM-level context parameter <math> \textbf{W}_s </math> is trained jointly in the process. Intuitively <math> \textbf{W}_s </math> learns how the HMs are significant for a cell type. Finally the entire gene region is encoded in a hidden representation <math> \textbf{v} </math>, using the weighted sum of the embedding of all HM marks. <br />
<br />
<br />
<math> \textbf{v} = \sum_{j=1}^MT{\beta^j \times s^j}</math><br />
<br />
<br />
== End-to-end training ==<br />
<br />
The embedding vector <math> \textbf{v} </math> is fed to a simple classification module, <math> f(\textbf{v}) = </math>softmax<math> (\textbf{W}_c\textbf{v}+b_c) </math>, where <math> \textbf{W}_c </math>, and <math> b_c </math> are learnable parameters. The output is the probability of gene expression being high (expressed) or low (suppressed).<br />
The whole model including the attention modules are differentiable. Thus backpropagation can perform end-to-end learning trivially. Negative log-likelihood loss function is minimized in the learning.<br />
<br />
= Reference =<br />
<br />
[1] Andrew J Bannister and Tony Kouzarides. Regulation of chromatin by histone modifications. Cell research, 21(3):381–395, 2011.<br />
<br />
[2] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.<br />
<br />
[3] Singh, Ritambhara, et al. "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin." Advances in Neural Information Processing Systems. 2017.<br />
<br />
[4] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.<br />
<br />
[5] Joanna Boros, Nausica Arnoult, Vincent Stroobant, Jean-François Collet, and Anabelle Decottignies. Polycomb repressive complex 2 and h3k27me3 cooperate with h3k9 methylation to maintain heterochromatin protein 1α at chromatin. Molecular and cellular biology, 34(19):3662–3674, 2014.<br />
<br />
[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning<br />
to align and translate. arXiv preprint arXiv:1409.0473, 2014.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data&diff=38647Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data2018-11-12T04:26:29Z<p>D35kumar: /* Method */ Correcting the detail about data</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of the essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.<br />
<br />
The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult. <br />
<br />
In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.<br />
<br />
In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]<br />
[[File:19floor.png|250px]]</div><br />
<br />
<br />
In this work, there are two major contributions. The first is that they trained an LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feedforward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is the algorithm which uses the output of the trained LSTM to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.<br />
<br />
The model does not rely on external sensors placed inside the building, prior knowledge of the building, or user movement behavior. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations.<br />
<br />
=Related Work=<br />
<br />
<br />
In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity. <br />
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.<br />
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.<br />
<br />
Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are: <br />
<ol><br />
<li> Classifier to predict whether the user is indoors or outdoors</li><br />
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li><br />
<li> Classifier to measure the displacement</li><br />
</ol><br />
<br />
One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.<br />
<br />
<br />
Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level. <br />
This method also suffers from relying on data specific to the building. <br />
<br />
Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.<br />
<br />
=Method=<br />
<br />
<br />
In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].<br />
<br />
To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment contexts like building floors, environment activity, city name, country name, and magnetic strength.<br />
<br />
The data collection procedure for indoor-outdoor classifier was described as follows:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV<br />
for analysis. This procedure can start either outside or inside a building without loss of generality.<br />
<br />
The following procedure generates data used to predict a floor change from the entrance floor to the end floor:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend<br />
through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9)<br />
Save data as CSV for analysis.<br />
<br />
Their algorithm used to predict floor level is a 3 part process:<br />
<br />
<ol><br />
<li> Classifying whether smartphone is indoor or outdoor </li><br />
<li> Indoor/Outdoor Transition detector</li><br />
<li> Estimating vertical height and resolving to absolute floor level </li><br />
</ol><br />
<br />
==1) Classifying Indoor/Outdoor ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div><br />
<br />
From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.<br />
<br />
The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math><br />
<br />
<br />
<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div><br />
<br />
They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.<br />
<br />
In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.<br />
<br />
For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|800px]] </div><br />
<br />
This is a critical part of their system and they only focus on the predictions in the subspace of being indoors. <br />
<br />
They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>. <br />
<br />
The cost function is shown below:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div><br />
<br />
The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.<br />
<br />
==2) Transition Detector ==<br />
<br />
Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div><br />
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math><br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div><br />
Jacard Distance:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div><br />
<br />
After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.<br />
<br />
==3) Vertical height and floor level ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div><br />
<br />
In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.<br />
<br />
The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math><br />
<br />
After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].<br />
<br />
In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.<br />
<br />
=Experiments and Results=<br />
<br />
==Dataset==<br />
<br />
In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.<br />
<br />
As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning is being used, the actual floor level was recorded manually for validation purposes only.<br />
<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div><br />
<br />
All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation. <br />
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006. <br />
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div><br />
<br />
The above chart shows the success that their system is able to achieve in floor level prediction.<br />
<br />
=Future Work=<br />
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.<br />
<br />
=Critique=<br />
<br />
In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment. <br />
<br />
A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. <br />
<br />
It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.<br />
<br />
They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].<br />
<br />
The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.<br />
<br />
Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem. <br />
<br />
=References=<br />
<br />
[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):<br />
1735–1780, 1997.<br />
<br />
[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.<br />
<br />
[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.<br />
<br />
[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018<br />
<br />
[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).<br />
<br />
[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.<br />
<br />
[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data&diff=38646Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data2018-11-12T04:18:53Z<p>D35kumar: /* Introduction */ Adding more information about the model [Technical]</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of the essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.<br />
<br />
The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult. <br />
<br />
In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.<br />
<br />
In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]<br />
[[File:19floor.png|250px]]</div><br />
<br />
<br />
In this work, there are two major contributions. The first is that they trained an LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feedforward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is the algorithm which uses the output of the trained LSTM to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.<br />
<br />
The model does not rely on external sensors placed inside the building, prior knowledge of the building, or user movement behavior. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations.<br />
<br />
=Related Work=<br />
<br />
<br />
In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity. <br />
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.<br />
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.<br />
<br />
Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are: <br />
<ol><br />
<li> Classifier to predict whether the user is indoors or outdoors</li><br />
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li><br />
<li> Classifier to measure the displacement</li><br />
</ol><br />
<br />
One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.<br />
<br />
<br />
Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level. <br />
This method also suffers from relying on data specific to the building. <br />
<br />
Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.<br />
<br />
=Method=<br />
<br />
<br />
In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].<br />
<br />
To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment context, environment mean bldg floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The data collection procedure for indoor-outdoor classifier was described as follows:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV<br />
for analysis. This procedure can start either outside or inside a building without loss of generality.<br />
<br />
The following procedure generates data used to predict a floor change from the entrance floor to the end floor:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend<br />
through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9)<br />
Save data as CSV for analysis.<br />
<br />
Their algorithm used to predict floor level is a 3 part process:<br />
<br />
<ol><br />
<li> Classifying whether smartphone is indoor or outdoor </li><br />
<li> Indoor/Outdoor Transition detector</li><br />
<li> Estimating vertical height and resolving to absolute floor level </li><br />
</ol><br />
<br />
==1) Classifying Indoor/Outdoor ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div><br />
<br />
From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.<br />
<br />
The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math><br />
<br />
<br />
<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div><br />
<br />
They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.<br />
<br />
In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.<br />
<br />
For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|800px]] </div><br />
<br />
This is a critical part of their system and they only focus on the predictions in the subspace of being indoors. <br />
<br />
They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>. <br />
<br />
The cost function is shown below:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div><br />
<br />
The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.<br />
<br />
==2) Transition Detector ==<br />
<br />
Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div><br />
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math><br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div><br />
Jacard Distance:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div><br />
<br />
After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.<br />
<br />
==3) Vertical height and floor level ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div><br />
<br />
In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.<br />
<br />
The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math><br />
<br />
After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].<br />
<br />
In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.<br />
<br />
=Experiments and Results=<br />
<br />
==Dataset==<br />
<br />
In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.<br />
<br />
As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning is being used, the actual floor level was recorded manually for validation purposes only.<br />
<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div><br />
<br />
All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation. <br />
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006. <br />
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div><br />
<br />
The above chart shows the success that their system is able to achieve in floor level prediction.<br />
<br />
=Future Work=<br />
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.<br />
<br />
=Critique=<br />
<br />
In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment. <br />
<br />
A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. <br />
<br />
It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.<br />
<br />
They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].<br />
<br />
The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.<br />
<br />
Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem. <br />
<br />
=References=<br />
<br />
[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):<br />
1735–1780, 1997.<br />
<br />
[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.<br />
<br />
[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.<br />
<br />
[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018<br />
<br />
[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).<br />
<br />
[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.<br />
<br />
[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data&diff=38645Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data2018-11-12T04:14:32Z<p>D35kumar: /* Introduction */ Adding more information about the model [Technical]</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of the essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.<br />
<br />
The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult. <br />
<br />
In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.<br />
<br />
In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]<br />
[[File:19floor.png|250px]]</div><br />
<br />
<br />
In this work, there are two major contributions. The first is that they trained an LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feedforward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is the algorithm which uses the output of the trained LSTM to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.<br />
<br />
=Related Work=<br />
<br />
<br />
In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity. <br />
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.<br />
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.<br />
<br />
Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are: <br />
<ol><br />
<li> Classifier to predict whether the user is indoors or outdoors</li><br />
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li><br />
<li> Classifier to measure the displacement</li><br />
</ol><br />
<br />
One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.<br />
<br />
<br />
Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level. <br />
This method also suffers from relying on data specific to the building. <br />
<br />
Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.<br />
<br />
=Method=<br />
<br />
<br />
In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].<br />
<br />
To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment context, environment mean bldg floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The data collection procedure for indoor-outdoor classifier was described as follows:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV<br />
for analysis. This procedure can start either outside or inside a building without loss of generality.<br />
<br />
The following procedure generates data used to predict a floor change from the entrance floor to the end floor:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend<br />
through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9)<br />
Save data as CSV for analysis.<br />
<br />
Their algorithm used to predict floor level is a 3 part process:<br />
<br />
<ol><br />
<li> Classifying whether smartphone is indoor or outdoor </li><br />
<li> Indoor/Outdoor Transition detector</li><br />
<li> Estimating vertical height and resolving to absolute floor level </li><br />
</ol><br />
<br />
==1) Classifying Indoor/Outdoor ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div><br />
<br />
From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.<br />
<br />
The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math><br />
<br />
<br />
<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div><br />
<br />
They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.<br />
<br />
In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.<br />
<br />
For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|800px]] </div><br />
<br />
This is a critical part of their system and they only focus on the predictions in the subspace of being indoors. <br />
<br />
They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>. <br />
<br />
The cost function is shown below:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div><br />
<br />
The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.<br />
<br />
==2) Transition Detector ==<br />
<br />
Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div><br />
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math><br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div><br />
Jacard Distance:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div><br />
<br />
After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.<br />
<br />
==3) Vertical height and floor level ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div><br />
<br />
In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.<br />
<br />
The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math><br />
<br />
After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].<br />
<br />
In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.<br />
<br />
=Experiments and Results=<br />
<br />
==Dataset==<br />
<br />
In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.<br />
<br />
As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning is being used, the actual floor level was recorded manually for validation purposes only.<br />
<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div><br />
<br />
All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation. <br />
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006. <br />
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div><br />
<br />
The above chart shows the success that their system is able to achieve in floor level prediction.<br />
<br />
=Future Work=<br />
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.<br />
<br />
=Critique=<br />
<br />
In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment. <br />
<br />
A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. <br />
<br />
It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.<br />
<br />
They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].<br />
<br />
The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.<br />
<br />
Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem. <br />
<br />
=References=<br />
<br />
[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):<br />
1735–1780, 1997.<br />
<br />
[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.<br />
<br />
[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.<br />
<br />
[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018<br />
<br />
[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).<br />
<br />
[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.<br />
<br />
[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data&diff=38644Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data2018-11-12T04:04:56Z<p>D35kumar: /* Introduction */ (Editorial)</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of the essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.<br />
<br />
The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult. <br />
<br />
In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.<br />
<br />
In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]<br />
[[File:19floor.png|250px]]</div><br />
<br />
<br />
In this work, there are two major contributions. The first is that they trained a recurrent neural network to classify whether a smartphone was either inside or outside of buildings. The second contribution is that they used the output of their previously trained classifier to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.<br />
<br />
=Related Work=<br />
<br />
<br />
In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity. <br />
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.<br />
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.<br />
<br />
Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are: <br />
<ol><br />
<li> Classifier to predict whether the user is indoors or outdoors</li><br />
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li><br />
<li> Classifier to measure the displacement</li><br />
</ol><br />
<br />
One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.<br />
<br />
<br />
Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level. <br />
This method also suffers from relying on data specific to the building. <br />
<br />
Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.<br />
<br />
=Method=<br />
<br />
<br />
In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].<br />
<br />
To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment context, environment mean bldg floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.<br />
<br />
The data collection procedure for indoor-outdoor classifier was described as follows:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV<br />
for analysis. This procedure can start either outside or inside a building without loss of generality.<br />
<br />
The following procedure generates data used to predict a floor change from the entrance floor to the end floor:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and<br />
out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost<br />
door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend<br />
through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9)<br />
Save data as CSV for analysis.<br />
<br />
Their algorithm used to predict floor level is a 3 part process:<br />
<br />
<ol><br />
<li> Classifying whether smartphone is indoor or outdoor </li><br />
<li> Indoor/Outdoor Transition detector</li><br />
<li> Estimating vertical height and resolving to absolute floor level </li><br />
</ol><br />
<br />
==1) Classifying Indoor/Outdoor ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div><br />
<br />
From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.<br />
<br />
The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math><br />
<br />
<br />
<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div><br />
<br />
They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.<br />
<br />
In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.<br />
<br />
For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|800px]] </div><br />
<br />
This is a critical part of their system and they only focus on the predictions in the subspace of being indoors. <br />
<br />
They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>. <br />
<br />
The cost function is shown below:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div><br />
<br />
The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.<br />
<br />
==2) Transition Detector ==<br />
<br />
Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div><br />
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math><br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div><br />
Jacard Distance:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div><br />
<br />
After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.<br />
<br />
==3) Vertical height and floor level ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div><br />
<br />
In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.<br />
<br />
The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math><br />
<br />
After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].<br />
<br />
In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.<br />
<br />
=Experiments and Results=<br />
<br />
==Dataset==<br />
<br />
In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.<br />
<br />
As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning is being used, the actual floor level was recorded manually for validation purposes only.<br />
<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div><br />
<br />
All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation. <br />
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006. <br />
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div><br />
<br />
The above chart shows the success that their system is able to achieve in floor level prediction.<br />
<br />
=Future Work=<br />
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.<br />
<br />
=Critique=<br />
<br />
In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment. <br />
<br />
A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. <br />
<br />
It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.<br />
<br />
They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].<br />
<br />
The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.<br />
<br />
Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem. <br />
<br />
=References=<br />
<br />
[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):<br />
1735–1780, 1997.<br />
<br />
[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.<br />
<br />
[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.<br />
<br />
[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018<br />
<br />
[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).<br />
<br />
[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.<br />
<br />
[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN&diff=38297Annotating Object Instances with a Polygon RNN2018-11-08T04:18:00Z<p>D35kumar: </p>
<hr />
<div>Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']<br />
<br />
The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].<br />
<br />
= Background =<br />
<br />
If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.<br />
<br />
Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):<br />
<br />
1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.<br />
<br />
2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.<br />
<br />
3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category. <br />
<br />
4. Instance Segmentation (''This paper performs this''): The goal is to not only to assign pixel-level categorical labels, but to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.<br />
<br />
[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]<br />
<br />
<br />
== Motivation ==<br />
<br />
Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.<br />
<br />
== Goal ==<br />
<br />
Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming. <br />
<br />
{| class=wikitable width=700 align=center<br />
|Thus, the '''main goal''' of the paper is to enable '''semi-automatic''' annotation of object instances.<br />
|}<br />
<br />
Figure 2 demonstrates how the interface looks like for better clarity.<br />
<br />
Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods. This approach works as the silhouette of an object is typically connected without holes. <br />
<br />
{| class=wikitable width=900 align=center<br />
|Thus, the authors suggest to adopt this same technique to annotate images using polygons, except they plan to automate the method and replace/reduce manual labeling. The '''intuition''' behind the success of this method is the '''sparse''' nature of these polygons that allow annotating of an object through a cluster of pixels rather than classification at the pixel-level.<br />
|}<br />
<br />
[[File:Annotating Object Instances Example.png | 450px|thumb|center|Figure 2: Given a bounding box, polygon outlining the the object instance inside the box is predicted. This approach is designed to facilitation annotation, and easily incorporates user corrections of points to improve the overall object’s polygon. ]]<br />
<br />
<br />
= Related Works =<br />
<br />
Some of the techniques used in semi-automatic annotation are as follows:<br />
<br />
1. '''GrabCut''': Some researchers use multiple scribbles from users to aid the model in defining the foreground and background. <br />
<br />
[[File:GrabCut_Example.png | 450px|thumb|center|Figure 3: Illustration of GrabCut.]]<br />
<br />
2. '''GrabCut + CNN''': Scribbles have also been used to train CNNs for semantic image segmentation. <br />
<br />
3. '''Superpixels''': Superpixels in the form of small polygons where the color intensity within each superpixel is similar, to a certain threshold, have been used to provide a sparse representation of the large number of pixels in an image. However, the performance of this technique depends on the scale of the superpixels and hence sometimes merges small objects.<br />
<br />
[[File:Superpixel_idea.jpg | 450px|thumb|center|Figure 4: Illustration of the superpixel idea.]] <br />
<br />
<br />
= Model =<br />
<br />
As an '''input''' to the model, an annotator or perhaps another neural network provides a bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.<br />
<br />
The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.<br />
<br />
The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.<br />
<br />
== Architecture ==<br />
<br />
There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.<br />
<br />
[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 5: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type) ('''Note''': A possible point of confusion - the authors have only shown the layers of VGG16 architecture here that have the skip connections introduced).]]<br />
<br />
1. '''CNN with skip connections''':<br />
<br />
The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of features fused together in a tensor that can feed into the RNN (refer to Figure 5). Namely, the last max-pooling layer (''pool5'') present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passes through 2 pooling layers and 2 convolutional layers. Since, the features extracted after each operation are to be preserved and fused later on, at each of these four steps, the idea is to have a tensor with a common width of 512; so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features as well as boundary/semantic information about the instances. Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.<br />
<br />
2. '''RNN - 2 Layer ConvLSTM'''<br />
<br />
The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which<br />
concatenates 4 features: the CNN feature representation of the image, one-hot encoding of the previous predicted vertex and the vertex predicted<br />
from two time steps ago, as well as the one-hot encoding of the first predicted vertex. <br />
<br />
The Convolutional LSTM computes the hidden state <math display = "inline">h_t</math> given the input <math display = "inline">x_t</math> based on the following equations:<br />
<center><br />
<math display="block"><br />
\begin{pmatrix}<br />
i_t \\<br />
f_t \\<br />
o_t \\<br />
g_t \\<br />
\end{pmatrix}<br />
= W_h * h_{t-1} + W_x * x_t + b<br />
</math><br />
<br />
<math display="block"><br />
c_t = \sigma(f_t) \bigodot c_{t-1} + \sigma(i_t) \bigodot tanh(g_t)<br />
</math><br />
<br />
<math display="block"><br />
h_t = \sigma(o_t) \bigodot tanh(c_t)<br />
</math><br />
</center><br />
where <math display = "inline">i, f, o</math> denote the input, forget, and output gate, <math display = "inline">h</math> is the hidden state and <math display = "inline">c</math> is the cell state. Also, <math display = "inline">\sigma</math> denotes the signoid function, <math display = "inline">\bigodot</math> indicates an element-wise product and <math display = "inline">*</math> a convolution. <math display = "inline">W_h</math> denotes the hidden-to-state convolution kernel and <math display = "inline">W_x</math> the input-to-state convolution kernel.<br />
<br />
The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict the first vertex. This CNN is trained separately. Here, <math display = "inline">y_t</math> denotes the one-hot encoding of the vertex and is the output at time step t.<br />
<br />
== Training ==<br />
<br />
The training of the model is done as follows:<br />
<br />
1. Cross-entropy is used for the RNN loss function.<br />
<br />
2. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10) <br />
<br />
3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function.<br />
<br />
The reported time for training is one day on a Nvidia Titan-X GPU.<br />
<br />
== Importance of Human Annotator in the Loop ==<br />
<br />
The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". Note that this is only possible due to the adoption of the RNN architecture i.e. the inherent nature of the RNN to accept previous outputs allows incorporation of the user's judgement. The typical inference time as quoted by the paper is 250ms per object.<br />
<br />
= Results =<br />
<br />
== Evaluation Metrics ==<br />
<br />
The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. There are two metrics used for evaluation:<br />
<br />
1. '''IoU''': The standard Intersection over Union (IoU) measure is used for comparison. In add The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.<br />
<br />
2. '''Number of Clicks''': To evaluate the speed up factor, the checkerboard distance is used to measure the distance between the ground truth (GT) and the output of the Polygon RNN. A set of distance thresholds are set <math display = "inline">T &isin; [1,2,3,4]</math> and if the distance exceeds the particular threshold, the correction is made by an annotator to match the GT and the '''Number of Clicks''' is used to evaluate the speed up factor.<br />
<br />
== Baseline Techniques ==<br />
<br />
1. '''SharpMask''': a 50 layer ResNet considered as the state of the art annotation method.<br />
<br />
2. '''DeepMask''': a build-up on the 50 layer ResNet with an addition of another CNN.<br />
<br />
3. '''Dilation10''': another simple technique using purely convolutional operations.<br />
<br />
4. '''SquareBox''': a simple technique where an entire bounding box is labeled as an object<br />
<br />
== Quantitative Results ==<br />
<br />
The Polygon RNN method outperforms the baselines in 6 out of the 8 categories and has a mean IoU greater than all of the baselines. Particularly, in the car, person, and rider categories, a 12%, 7%, and 6% higher performance than SharpMask is achieved.<br />
<br />
[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]<br />
<br />
In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.<br />
<br />
[[File:Table_0_Neel.JPG | 800px|thumb|center|Table 2: IoU performance on Cityscapes data with annotator intervention.]]<br />
<br />
The method also works well with other datasets such as KITTI:<br />
<br />
[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 3: IoU performance on KITTI data.]]<br />
<br />
== Qualitative Results ==<br />
<br />
In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.<br />
<br />
<gallery widths=500px heights=500px perrow=2 mode="packed"><br />
File:Figure_3_Neel.JPG|Figure 6: Qualitative results: comparison with human annotator.|alt=alt language<br />
File:Figure_4_Neel.JPG|Figure 7: Qualitative results: comparison with human annotator.|alt=alt language<br />
</gallery><br />
<br />
=Conclusion=<br />
<br />
The important conclusions from this paper are:<br />
<br />
1. The paper presented a powerful generic annotation tool for modelling complex annotations as a simple polygon that works on different unseen datasets. <br />
<br />
2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself (speed-up factor of 4.74).<br />
<br />
3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.<br />
<br />
4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.<br />
<br />
5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.<br />
<br />
=Critique=<br />
<br />
1. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.<br />
<br />
2. Given that this model uses the VGG16 architecture compared to the 50 layer ResNet in SharpMask, this method is quite efficient.<br />
<br />
3. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.<br />
<br />
4. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.<br />
<br />
5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Annotating_Object_Instances_Example.png&diff=38293File:Annotating Object Instances Example.png2018-11-08T04:13:12Z<p>D35kumar: Explains Goal of the paper</p>
<hr />
<div>Explains Goal of the paper</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure6.png&diff=38292File:Figure6.png2018-11-08T04:11:47Z<p>D35kumar: D35kumar uploaded a new version of File:Figure6.png</p>
<hr />
<div></div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Example_Goal.png&diff=38291File:Example Goal.png2018-11-08T04:09:39Z<p>D35kumar: </p>
<hr />
<div></div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18&diff=38031stat946F182018-11-06T07:19:19Z<p>D35kumar: /* Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing] */</p>
<hr />
<div>== [[F18-STAT946-Proposal| Project Proposal ]] ==<br />
<br />
=Paper presentation=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
<br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]<br />
|-<br />
|Oct 25 || Dhruv Kumar || 1 || Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs || [https://openreview.net/pdf?id=rkRwGg-0Z Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/images/e/ea/Beyond_Word_Importance.pdf Slides]<br />
|-<br />
|Oct 25 || Amirpasha Ghabussi || 2 || DCN+: Mixed Objective And Deep Residual Coattention for Question Answering || [https://openreview.net/pdf?id=H1meywxRW Paper] ||<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering Summary]<br />
|-<br />
|Oct 25 || Juan Carrillo || 3 || Hierarchical Representations for Efficient Architecture Search || [https://arxiv.org/abs/1711.00436 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/images/1/15/HierarchicalRep-slides.pdf Slides]<br />
|-<br />
|Oct 30 || Manpreet Singh Minhas || 4 || End-to-end Active Object Tracking via Reinforcement Learning || [http://proceedings.mlr.press/v80/luo18a/luo18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End_to_end_Active_Object_Tracking_via_Reinforcement_Learning Summary]<br />
|-<br />
|Oct 30 || Marvin Pafla || 5 || Fairness Without Demographics in Repeated Loss Minimization || [http://proceedings.mlr.press/v80/hashimoto18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization Summary]<br />
|-<br />
|Oct 30 || Glen Chalatov || 6 || Pixels to Graphs by Associative Embedding || [http://papers.nips.cc/paper/6812-pixels-to-graphs-by-associative-embedding Paper] ||<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding Summary]<br />
|-<br />
|Nov 1 || Sriram Ganapathi Subramanian || 7 ||Differentiable plasticity: training plastic neural networks with backpropagation || [http://proceedings.mlr.press/v80/miconi18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/images/3/3c/Deep_learning_course_presentation.pdf Slides]<br />
|-<br />
|Nov 1 || Hadi Nekoei || 8 || Synthesizing Programs for Images using Reinforced Adversarial Learning || [http://proceedings.mlr.press/v80/ganin18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Synthesizing_Programs_for_Images_using_Reinforced_Adversarial_Learning.pdf Slides]<br />
|-<br />
|Nov 1 || Henry Chen || 9 || DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks || [https://ieeexplore.ieee.org/abstract/document/7989236 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DeepVO_Towards_end_to_end_visual_odometry_with_deep_RNN Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DeepVO_Presentation_Henry.pdf Slides] <br />
|-<br />
|Nov 6 || Nargess Heydari || 10 ||Wavelet Pooling For Convolutional Neural Networks Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks Summary]<br />
|-<br />
|Nov 6 || Aravind Ravi || 11 || Towards Image Understanding from Deep Compression Without Decoding || [https://openreview.net/forum?id=HkXWCMbRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding Summary]<br />
|-<br />
|Nov 6 || Ronald Feng || 12 || Learning to Teach || [https://openreview.net/pdf?id=HJewuJWCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach Summary]<br />
|-<br />
|Nov 8 || Neel Bhatt || 13 || Annotating Object Instances with a Polygon-RNN || [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN Summary]<br />
|-<br />
|Nov 8 || Jacob Manuel || 14 || Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels || [https://arxiv.org/pdf/1804.06872.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching Summary]<br />
|-<br />
|Nov 8 || Charupriya Sharma|| 15 || Tighter Variational Bounds are Not Necessarily Better || [https://arxiv.org/pdf/1802.04537.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Tighter_Variational_Bounds_are_Not_Necessarily_Better Summary]<br />
|-<br />
|NOv 13 || Sagar Rajendran || 16 || Zero-Shot Visual Imitation || [https://openreview.net/pdf?id=BkisuzWRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation Summary]<br />
|-<br />
<br />
|Nov 13 || Ruijie Zhang || 17 || Searching for Efficient Multi-Scale Architectures for Dense Image Prediction || [https://arxiv.org/pdf/1809.04184.pdf Paper]||<br />
|-<br />
|Nov 13 || Neil Budnarain || 18 || Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data || [https://openreview.net/pdf?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data Summary]<br />
|-<br />
|NOv 15 || Zheng Ma || 19 || Reinforcement Learning of Theorem Proving || [https://arxiv.org/abs/1805.07563 Paper] || <br />
|-<br />
|Nov 15 || Abdul Khader Naik || 20 || || ||<br />
|-<br />
|Nov 15 || Johra Muhammad Moosa || 21 || Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin || [https://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin.pdf Paper] || <br />
|-<br />
|NOv 20 || Zahra Rezapour Siahgourabi || 22 || || || <br />
|-<br />
|Nov 20 || Shubham Koundinya || 23 || || || <br />
|-<br />
|Nov 20 || Salman Khan || 24 || Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples || [http://proceedings.mlr.press/v80/athalye18a.html paper] || <br />
|-<br />
|NOv 22 ||Soroush Ameli || 25 || Learning to Navigate in Cities Without a Map || [https://arxiv.org/abs/1804.00168 paper] || <br />
|-<br />
|Nov 22 ||Ivan Li || 26 || Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction || [https://arxiv.org/pdf/1802.05451v3.pdf Paper] ||<br />
|-<br />
|Nov 22 ||Sigeng Chen || 27 || || ||<br />
|-<br />
|Nov 27 || Aileen Li || 28 || Spatially Transformed Adversarial Examples ||[https://openreview.net/pdf?id=HyydRMZC- Paper] || <br />
|-<br />
|NOv 27 ||Xudong Peng || 29 || Multi-Scale Dense Networks for Resource Efficient Image Classification || [https://openreview.net/pdf?id=Hk2aImxAb Paper] || <br />
|-<br />
|Nov 27 ||Xinyue Zhang || 30 || An Inference-Based Policy Gradient Method for Learning Options || [http://proceedings.mlr.press/v80/smith18a/smith18a.pdf Paper] || <br />
|-<br />
|NOv 29 ||Junyi Zhang || 31 || Autoregressive Convolutional Neural Networks for Asynchronous Time Series || [http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Paper] ||<br />
|-<br />
|Nov 29 ||Travis Bender || 32 || Automatic Goal Generation for Reinforcement Learning Agents || [http://proceedings.mlr.press/v80/florensa18a/florensa18a.pdf Paper] ||<br />
|-<br />
|Nov 29 ||Patrick Li || 33 || Matrix Capsules with EM Routing || [https://openreview.net/pdf?id=HJWLfGWRb Paper] ||<br />
|-<br />
|Makeup || Jiazhen Chen || 34 || || || <br />
|-<br />
|Makeup || Ahmed Afify || 35 ||Don't Decay the Learning Rate, Increase the Batch Size || [https://openreview.net/pdf?id=B1Yy1BxCZ Paper]||<br />
|-<br />
|Makeup || Gaurav Sahu || 36 || TBD || ||<br />
|-<br />
|Makeup || Kashif Khan || 37 || Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] ||<br />
|-<br />
|Makeup || Shala Chen || 38 || A NEURAL REPRESENTATION OF SKETCH DRAWINGS || ||<br />
|-<br />
|Makeup || Ki Beom Lee || 39 || Detecting Statistical Interactions from Neural Network Weights|| [https://openreview.net/forum?id=ByOfBggRZ Paper] ||<br />
|-<br />
|Makeup || Wesley Fisher || 40 || Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling || [http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling Summary]</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Beyond_Word_Importance.pdf&diff=38030File:Beyond Word Importance.pdf2018-11-06T07:17:23Z<p>D35kumar: </p>
<hr />
<div></div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching&diff=37781Co-Teaching2018-11-05T01:21:19Z<p>D35kumar: /* Intuition */ Minor grammar corrections</p>
<hr />
<div>=Introduction=<br />
==Title of Paper==<br />
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels<br />
==Contributions==<br />
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness<br />
of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art<br />
baselines.<br />
<br />
==Terminology==<br />
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data. <br />
<br />
Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels.<br />
<br />
=Intuition=<br />
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually while learning on a batch of noisy data only the error from the network itself is transferred back to facilitate learning. But in the case of Co-Teaching the two networks that are used are able to filter the different type of errors which flows back to themselves as well as the other network. Therefore the models learn mutually, i.e., from themselves as well as from the partner network.<br />
<br />
=Motivation=<br />
The paper draws motivation from two key facts:<br />
• That many data collection processes yield noisy labels. <br />
• That deep neural networks have a high capacity to overfit to noisy labels. <br />
Because of these facts, it is challenging to train deep networks to be robust with noisy labels. <br />
=Related Works=<br />
<br />
Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling. <br />
<br />
In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators. <br />
<br />
In the Deep learning based approaches we have works suggesting a unified framework to distill the knowledge from clean labels and knowledge graph. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously. <br />
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators. <br />
<br />
Learning to teach methods is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. <br />
<br />
=Summary of Experiment=<br />
==Proposed Method==<br />
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network. <br />
[[File:Co-Teaching Fig 1.png|600px|center]] <br />
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate). <br />
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself. <br />
==Dataset Corruption==<br />
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry. <br />
[[File:Co-Teaching Fig 2.png|600px|center]] <br />
Three noise conditions are simulated for comparing co-teaching with baseline methods.<br />
{| class="wikitable"<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Method<br />
|width="100pt"|Noise Rate<br />
|width="700pt"|Rationale<br />
|-<br />
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels. <br />
|-<br />
| Symmetry || 50% || Half of the instances have noisy labels. Further rationale can be found at [1].<br />
|-<br />
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario. <br />
|}<br />
|}<br />
==Baseline Comparisons==<br />
The co-teaching method is compared with several baseline approaches, which have varying:<br />
• proficiency in dealing with a large number of classes,<br />
• ability to resist heavy noise,<br />
• need to combine with specific network architectures, and<br />
• need to be pretrained. <br />
<br />
[[File:Co-Teaching Fig 3.png|600px|center]] <br />
===Bootstrap===<br />
A method that deems a weighted combination of predicted and original labels as correct, and then solves kernels by backpropagation [2].<br />
===S-Model===<br />
Using an additional softmax layer to model the noise transition matrix [3].<br />
===F-Correction===<br />
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].<br />
===Decoupling===<br />
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].<br />
===MentorNet===<br />
A teacher network is trained to identify and discard noisy instances in order to train the student network on cleaner instances [6].<br />
==Implementation Details==<br />
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. <br />
[[File: Co-Teaching Table 3.png|center]] <br />
=Results and Discussion=<br />
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows. <br />
==MNIST==<br />
[[File:Co-Teaching Table 4.png|550px|center]]<br />
<br />
[[File:Co-Teaching Graphs MNIST.PNG|center]]<br />
==CIFAR10==<br />
[[File:Co-Teaching Table 5.png|550px|center]] <br />
<br />
[[File:Co-Teaching Graphs CIFAR10.PNG|center]]<br />
==CIFAR100==<br />
[[File:Co-Teaching Table 6.png|550px|center]] <br />
<br />
[[File: Co-Teaching Graphs CIFAR100.PNG|center]]<br />
<br />
=Conclusions=<br />
The main goal of the paper is to introduce “Co-teaching” learning paradigm that uses two deep neural networks learning at the same time to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depends on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).<br />
<br />
=Critique=<br />
==Lack of Task Diversity==<br />
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality. <br />
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==<br />
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm. <br />
==Lack of Theoretical Development (Mentioned in conclusion)==<br />
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly. <br />
=References=<br />
[1] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen. MNE software for processing MEG and EEG data. Neuroimage, 86:446–460, 2014. <br />
<br />
[2] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.<br />
<br />
[3] M. Jas, T. Dupré La Tour, U. Şimşekli, and A. Gramfort. Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1–15, 2017.<br />
<br />
[4] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.<br />
<br />
[5] R. Chalasani, J. C. Principe, and N. Ramakrishnan. A fast proximal method for convolutional sparse coding. In International Joint Conference on Neural Networks (IJCNN), pages 1–5, 2013. ISBN 9781467361293.<br />
<br />
[6] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), pages 5135–5143. IEEE, 2015.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching&diff=37780Co-Teaching2018-11-05T01:19:06Z<p>D35kumar: /* Introduction */ Adding Intuition [Technical]</p>
<hr />
<div>=Introduction=<br />
==Title of Paper==<br />
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels<br />
==Contributions==<br />
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness<br />
of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art<br />
baselines.<br />
<br />
==Terminology==<br />
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data. <br />
<br />
Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels.<br />
<br />
=Intuition=<br />
Similar to Decoupling, the Co-teaching architecture also maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually while learning on a batch of data only the error from the network itself is transferred back to facilitate learning. But in the case of Co-Teaching the two networks that are used are able to filter the different type of errors which flows to themselves as well as the other network. Therefore the models learn mutually from themselves as well as from the partner network.<br />
<br />
=Motivation=<br />
The paper draws motivation from two key facts:<br />
• That many data collection processes yield noisy labels. <br />
• That deep neural networks have a high capacity to overfit to noisy labels. <br />
Because of these facts, it is challenging to train deep networks to be robust with noisy labels. <br />
=Related Works=<br />
<br />
Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling. <br />
<br />
In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators. <br />
<br />
In the Deep learning based approaches we have works suggesting a unified framework to distill the knowledge from clean labels and knowledge graph. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously. <br />
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators. <br />
<br />
Learning to teach methods is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. <br />
<br />
=Summary of Experiment=<br />
==Proposed Method==<br />
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network. <br />
[[File:Co-Teaching Fig 1.png|600px|center]] <br />
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate). <br />
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself. <br />
==Dataset Corruption==<br />
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry. <br />
[[File:Co-Teaching Fig 2.png|600px|center]] <br />
Three noise conditions are simulated for comparing co-teaching with baseline methods.<br />
{| class="wikitable"<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Method<br />
|width="100pt"|Noise Rate<br />
|width="700pt"|Rationale<br />
|-<br />
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels. <br />
|-<br />
| Symmetry || 50% || Half of the instances have noisy labels. Further rationale can be found at [1].<br />
|-<br />
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario. <br />
|}<br />
|}<br />
==Baseline Comparisons==<br />
The co-teaching method is compared with several baseline approaches, which have varying:<br />
• proficiency in dealing with a large number of classes,<br />
• ability to resist heavy noise,<br />
• need to combine with specific network architectures, and<br />
• need to be pretrained. <br />
<br />
[[File:Co-Teaching Fig 3.png|600px|center]] <br />
===Bootstrap===<br />
A method that deems a weighted combination of predicted and original labels as correct, and then solves kernels by backpropagation [2].<br />
===S-Model===<br />
Using an additional softmax layer to model the noise transition matrix [3].<br />
===F-Correction===<br />
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].<br />
===Decoupling===<br />
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].<br />
===MentorNet===<br />
A teacher network is trained to identify and discard noisy instances in order to train the student network on cleaner instances [6].<br />
==Implementation Details==<br />
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. <br />
[[File: Co-Teaching Table 3.png|center]] <br />
=Results and Discussion=<br />
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows. <br />
==MNIST==<br />
[[File:Co-Teaching Table 4.png|550px|center]]<br />
<br />
[[File:Co-Teaching Graphs MNIST.PNG|center]]<br />
==CIFAR10==<br />
[[File:Co-Teaching Table 5.png|550px|center]] <br />
<br />
[[File:Co-Teaching Graphs CIFAR10.PNG|center]]<br />
==CIFAR100==<br />
[[File:Co-Teaching Table 6.png|550px|center]] <br />
<br />
[[File: Co-Teaching Graphs CIFAR100.PNG|center]]<br />
<br />
=Conclusions=<br />
The main goal of the paper is to introduce “Co-teaching” learning paradigm that uses two deep neural networks learning at the same time to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depends on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).<br />
<br />
=Critique=<br />
==Lack of Task Diversity==<br />
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality. <br />
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==<br />
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm. <br />
==Lack of Theoretical Development (Mentioned in conclusion)==<br />
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly. <br />
=References=<br />
[1] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen. MNE software for processing MEG and EEG data. Neuroimage, 86:446–460, 2014. <br />
<br />
[2] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.<br />
<br />
[3] M. Jas, T. Dupré La Tour, U. Şimşekli, and A. Gramfort. Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1–15, 2017.<br />
<br />
[4] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.<br />
<br />
[5] R. Chalasani, J. C. Principe, and N. Ramakrishnan. A fast proximal method for convolutional sparse coding. In International Joint Conference on Neural Networks (IJCNN), pages 1–5, 2013. ISBN 9781467361293.<br />
<br />
[6] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), pages 5135–5143. IEEE, 2015.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding&diff=37773stat946w18/Towards Image Understanding From Deep Compression Without Decoding2018-11-05T00:46:38Z<p>D35kumar: /* Introduction */ (Editorial)</p>
<hr />
<div>Paper Title: Towards Image Understanding from Deep Compression Without Decoding - ICLR 2018<br />
<br />
Presented By: Aravind Ravi<br />
<br />
== Introduction ==<br />
Recent advances in the deep neural network (DNN) based image compression methods have shown potential improvements in image quality, savings in storage and bandwidth reduction. These methods leverage common neural network architectures such as convolutional autoencoders or recurrent neural networks to compress and reconstruct RGB images and outperform classical techniques such as JPEG2000 and BPG on perceptual metrics such as structural similarity index (SSIM) and multi-scale structural similarity index (MS-SSIM).<br />
<br />
These approaches encode an image <math>x </math> to some feature map (compressed representation), which is subsequently quantized to a set of symbols <math>z </math>. These symbols are then losslessly compressed to a bitstream, from which a decoder reconstructs an image <math>{\hat{x}} </math>, of the same dimensions as <math>x </math>.<br />
<br />
In this paper, the authors explore the idea of applying the learned representations to perform inference without reconstructing the compressed image. Specifically, instead of reconstructing an RGB image from the compressed representation and feeding it to a network for inference, the paper proposes to use a modified network that bypasses reconstruction of the RGB image.<br />
<br />
The rationale behind this approach is that the neural network architectures commonly used for learned compression (in particular the encoders) are similar to the ones commonly used for inference, and learned image encoders are hence, in principle, capable of extracting features relevant for inference tasks. The encoder might learn features relevant for inference purely by training on the compression task, and can be forced to learn these features by training on the compression and inference tasks jointly<br />
<br />
The advantage of learning an encoder for image compression which produces compressed representation containing features relevant for inference is obvious in scenarios where images are transmitted (e.g. from a mobile device) before processing (e.g. in the cloud), as it saves reconstruction of the RGB image as well as part of the feature extraction and hence speeds up processing. A typical use case is a cloud photo storage application where every image is processed immediately upon upload for indexing and search purposes.<br />
<br />
Note: [https://en.wikipedia.org/wiki/Structural_similarity More Information on SSIM, MSSIM]<br />
<br />
== Intuition ==<br />
<br />
Compression techniques (something as common as zipping) are commonly used by us in the day to day file handling tasks. Most often we use the engineered compression techniques. Deep Neural Networks (DNNs) are nonlinear function approximators which act as feature extractors, extracting features from inputs (like images or sound files). These can be seen as learning based compression techniques as they can perform compression and they can be trained using back propagation as well. If image classification can be done on these compressed files, large image data sets like hyperspectral images and MRI images can be stored efficiently and the compressed files can be used directly by the DNNs for classification or reinforcement learning tasks.<br />
<br />
==Motivation and Contributions==<br />
The authors propose to perform image understanding tasks such as image classification and segmentation directly on DNN based compressed representations. Performing the image understanding tasks on the compressed representations/encoded feature maps has two advantages. <br />
# This method bypasses the process of decoding the image into the RGB space before classification<br />
# The authors show that it reduces the overall computational complexity up to 2 times<br />
<br />
=== Contributions of the Paper ===<br />
* A method to perform image classification and semantic segmentation from compressed representations.<br />
* The proposed method offers classification accuracy similar to that achieved on decompressed images while reducing the computational complexity by 2 times.<br />
* Semantic segmentation has been shown to be as accurate as performance on decompressed images for moderate compression rates and higher accuracy for aggressive compression rates. In addition, this method achieves lower computational complexity.<br />
* Joint training for image compression and classification has been shown to improve the quality of the image and increase in accuracy of classification and segmentation<br />
<br />
==Related Work==<br />
<br />
The prior work has shown image classification from compressed images based on engineered codecs. Some of the works in this area are:<br />
<br />
* In video analysis domain: Action recognition (Yeo et al., 2008; Kantorov & Laptev, 2014)<br />
* Classification of compressed hyperspectral images (Hahn et al., 2014; Aghagolzadeh & Radha, 2015)<br />
* Discrete Cosine Transform based compression performed on images before feeding into a neural network, which shows an improvement in training speed by up to 10 times Fu & Guimaraes (2016)<br />
* Video analysis on compressed video (using engineered codecs) has also been studied in the past (Babu et al., 2016)<br />
* Criticism on document image analysis methods (Javed et al.2017)<br />
<br />
The authors propose a method that does inference on top of learned feature representation and hence has a direct relation to unsupervised feature learning using autoencoders.<br />
They also claim that so far there hasn't been any work using learned compressed representations for image classification and segmentation.<br />
<br />
==Learned Deeply Compressed Representations==<br />
<br />
The image compression task is performed based on a convolutional autoencoder architecture proposed by Theis et al. 2017 (shown in the figure below), and a variant of the training procedure described by Agustsson et. al 2017. <br />
<br />
[[File:AR_theisAutoencoder.png|600px|center]]<br />
<br />
=== Compression Architecture ===<br />
<br />
The compression network is an autoencoder that takes an input image <math>x </math> and outputs <math>{\hat{x}} </math> as the approximation to the input. <br />
<br />
[[File:AR_Fig2a.png|300px|center]]<br />
<br />
The encoder has the following structure: It starts with 2 convolutional layers with spatial subsampling by a factor of 2, followed by 3 residual units, and a final convolutional layer with spatial subsampling by a factor of 2. This results in a <math>w/8</math> x <math>h/8</math> x <math>C</math> dimensional representation, where <math>w </math> and <math>h </math> are the spatial dimensions of <math>x </math>, and the number of channels C is a hyperparameter related to the rate <math>R </math>. This representation is then quantized to a discrete set of symbols, forming a compressed representation, <math>z </math>.<br />
<br />
To get the reconstruction <math>{\hat{x}} </math>, the compressed representation is fed into the decoder, which mirrors the encoder, but uses upsampling and deconvolutions instead of subsampling and convolutions.<br />
<br />
Quantizing the compressed representation imposes a distortion <math>D </math> on <math>{\hat{x}} </math> w.r.t. <math>x </math>, i.e., it increases the reconstruction error. This is traded for a decrease in entropy of the quantized compressed representation<br />
<math>z </math> which leads to a decrease of the length of the bitstream as measured by the rate <math>R </math>. Thus, to train the image compression network, the classical rate-distortion trade-off <math>D + \beta R</math> is minimized. As a metric for <math>D </math>, the mean squared error (MSE) between <math>x </math> and <math>{\hat{x}} </math> are used and <math>R</math> is estimated using<br />
<math>H(q)</math>. <math>H(q)</math> is the entropy of the probability distribution over the symbols and is estimated using a histogram of the probability distribution (as done by Agustsson et al., 2017). The trade-off between MSE and the entropy is controlled by adjusting <math>\beta </math>. For each <math>\beta </math> an operating point is derived where the images have a certain bit rate, as measured by bits per pixel (bpp), and corresponding MSE. To better control the bpp, a target entropy Ht is introduced by the authors to formulate the loss defined as:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)<br />
\end{align}<br />
<br />
Agustsson et. al 2017, proposed a method to overcome the issue of non-differentiability of the quantization step by proposing a differentiable approximation to the quantization. This method has been adapted to suit the current application in the paper.<br />
<br />
Three operating points at 0.0983 bpp (C=8), 0.330 bpp (C=16), and 0.635 bpp (C=32) are obtained empirically. All further experiments are performed with these three operating points and the results for the same are presented in the following sections.<br />
<br />
==Image Classification from Compressed Representations==<br />
<br />
=== Classification on RGB Images ===<br />
<br />
For the image classification task based on the RGB images, the authors use the ResNet-50 architecture. <br />
Further information on residual networks can be found in the following links: <br />
[https://youtu.be/K0uoBKBQ1gA ResNets Part-1]<br />
[https://youtu.be/GSsKdtoatm8 ResNets Part-2]<br />
<br />
The details of the architecture are presented in the table below:<br />
<br />
[[File:AR_Tab1.png|400px|center]]<br />
<br />
In this paper, the number of 14x14 (conv4_x) blocks have been modified to obtain a new architecture called ResNet-71. <br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
For input images with spatial dimension 224x224, the encoder of the compression network outputs a compressed representation with dimensions 28x28xC, where C is the number of channels. To use this compressed representation as input to the classification network, a simple variant of the ResNet architecture is proposed. This variant is referred to as cResNet-k, where c stands for “compressed representation” and k is the<br />
number of convolutional layers in the network. These networks are constructed by simply “cutting off” the front of the regular (RGB) ResNet. The root-block of the network and the residual layers that have a larger spatial dimension than 28x28 are removed. To adjust the number of layers k, the ResNet architecture proposed by He et al. (2015) is used and the number of 14x14 (conv4 x) residual blocks are modified.<br />
<br />
In this way, three different architectures are derived:<br />
* cResNet-39 is ResNet-50 with the first 11 layers removed as described above, and this significantly reduces computational cost<br />
* cResNet-51<br />
* cResNet-72<br />
<br />
cResNet-51 and cResNet-72 are obtained by adding 14x14 residual blocks to match the computational cost of ResNet-50 and ResNet-71 respectively.<br />
<br />
The detailed description of all the network architectures are presented below:<br />
<br />
[[File:AR_Tab3.png|600px|center]]<br />
<br />
==Semantic Segmentation from Compressed Representations==<br />
<br />
For semantic segmentation, the ResNet based DeepLab architecture is adapted for the proposed application. The cResNet<br />
and ResNet image classification architectures are re-purposed with atrous<br />
convolutions, where the filters are upsampled instead of downsampling the feature maps. This is<br />
done to increase their receptive field and to prevent aggressive subsampling of the feature maps. For segmentation, the ResNet architecture is restructured such<br />
that the output feature map has 8 times smaller spatial dimension than the original RGB image (instead<br />
subsampling by a factor 32 times like for classification). When using the cResNets the output feature<br />
map has the same spatial dimensions as the input compressed representation (instead of subsampling<br />
4 times like for classification). This results in comparably sized feature maps for both the compressed<br />
representation and the reconstructed RGB images. Finally the last 1000-way classification layer of<br />
these classification architectures is replaced by an atrous spatial pyramid pooling (ASPP) with four<br />
parallel branches with rates {6, 12, 18, 24}, which provides the final pixel-wise classification.<br />
<br />
==Joint Training for Compression and Image Classification==<br />
<br />
The authors propose a joint training strategy to combine compression and classification tasks. To do this, the proposed method combines the compression network and the cResNet-51 architecture. The figure below shows the combined pipeline:<br />
<br />
[[File:AR_Fig2b.png|300px|center]]<br />
<br />
All parts, encoder, decoder, and inference network, are trained at the same time. The compressed representation is fed<br />
to the decoder to optimize for mean-squared reconstruction error and to a cResNet-51 network to<br />
optimize for classification using a cross-entropy loss. The combined loss function takes the form:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \gamma(\text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0))+l_{ce}(y,{\hat{y}})<br />
\end{align}<br />
<br />
where the loss terms for the compression network, <math> \mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)</math>, are the same as in training for compression only. <math> l_{ce}</math> is the cross-entropy loss for classification.<br />
<math>\gamma </math> controls the trade-off between the compression loss and the classification loss.<br />
<br />
==Experiments and Results==<br />
<br />
=== Learned Deeply Compressed Representations Results ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset.<br />
<br />
The metrics used to measure the compression quality are as follows: <br />
* PSNR (Peak Signal-to-Noise Ratio) is a standard measure, depending monotonically on mean squared error defined as: <br />
<br />
\begin{align}<br />
PSNR = 10(\log_{10}(255^2/MSE))<br />
\end{align}<br />
<br />
* SSIM (Structural Similarity Index) and MS-SSIM (Multi-Scale SSIM) are metrics proposed to measure the similarity of images as perceived by humans<br />
<br />
The figure below depicts the performance of the deep compression models vs. standard JPEG and JPEG2000. Higher values are better. The proposed technique outperforms the JPEG and JPEC2000 at the operating points used in this paper.<br />
<br />
[[File:AR_Fig8.png|600px|center]]<br />
<br />
The learned compressed representations are illustrated in the figure below. <br />
<br />
[[File:AR_Fig9.png|500px|center]]<br />
<br />
In the above figure, the original RGB-image is shown along with compressed versions of the RGB image which are reconstructed from the compressed representations. The 4 channels with the highest entropy are shown in the visualizations. These visualizations indicate how the networks compress an image, as the rate (bpp) gets lower the entropy cost of the network forces the<br />
compressed representation to use fewer quantization levels, as can clearly be seen. For the most aggressive compression, the channel maps use only 2 levels for the compressed representation.<br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset. It consists of 1.28 million training images and 50k validation images. These images are distributed across 1000 diverse classes. For image classification, the top-1 classification accuracy and top-5 classification accuracy are reported on the validation set on 224x224 center crops for RGB images and 28x28 center crops for the compressed representation.<br />
<br />
==== Training Procedure ====<br />
<br />
The compression network is fixed while training the classification network, both when training with compressed representations and with reconstructed compressed RGB images. For the compressed representations, the output of the fixed encoder (the compressed representation) is provided input to the cResNets (decoder is not needed). When training on the reconstructed compressed RGB images, the output of the fixed encoder-decoder (RGB image) is provided as input to the ResNet. This is done for each operating point.<br />
<br />
Refer to Appendix A Section A4, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Classification Results ====<br />
<br />
The tables below present the results of the classification at each operating point, both classifying from the compressed representation and the corresponding reconstructed compressed RGB images.<br />
<br />
[[File:AR_Tab2.png|400|center]]<br />
<br />
Figure below shows the validation curves for ResNet-50, cResNet-51, and cResNet-39. <br />
<br />
[[File:AR_Fig3.png|700|center]]<br />
<br />
For the 2 classification architectures with the same computational complexity (ResNet-50 and cResNet-51), the validation curves at the 0.635 bpp compression operating point almost coincide, with ResNet-50 performing slightly better. As the rate (bpp) gets smaller this performance gap gets smaller. The table above shows the<br />
classification results when the different architectures have converged. At the 0.635 bpp operating point, ResNet-50 only performs 0.5% better in top-5 accuracy than cResNet-51, while for the 0.0983 bpp operating point this difference is only 0.3%.<br />
Using the same pre-processing and the same learning rate schedule but starting from the original uncompressed RGB images yields 89.96% top-5 accuracy. The top-5 accuracy obtained from the compressed representation at the 0.635 bpp compression operating point, 87.85%, is even competitive<br />
with that obtained for the original images at a significantly lower storage cost. Specifically, at 0.635 bpp the ImageNet dataset requires 24.8 GB of storage space instead of 144 GB for the original version, a reduction by a factor 5.8 times.<br />
<br />
Notes on top-1 and top-5 accuracy:<br />
<br />
* Top-1 accuracy: This is the conventional accuracy metric used in machine learning. Wherein if the true label of the input to a model matches the highest probability class of the last layer of the output of CNN (predicted class probability), then the given input is correctly classified, else it is considered as incorrectly classified.<br />
* Top-5 accuracy: In this case, if any of the model's 5 highest classification probabilities match with the true label of the input, then this is considered as a correct classification, else it is an incorrect classification.<br />
<br />
===Semantic Segmentation Results===<br />
<br />
All experiments have been performed on the PASCAL VOC-2012 dataset for semantic segmentation. It has 20 object foreground classes and 1 background class. The dataset<br />
consists of 1464 training and 1449 validation images. In every image, each pixel is annotated with<br />
one of the 20 + 1 classes. The original dataset is furthermore augmented with extra annotations, so the final dataset has 10,582 images for training and 1449 images for validation.<br />
<br />
All performance is measured on pixelwise intersection-over-union (IoU) averaged over all the classes or mean-intersection-over-union (mIoU) on the validation set. <br />
<br />
[https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ Details on IoU]<br />
<br />
==== Training Procedure ====<br />
The cResNet/ResNet networks are pre-trained on the ImageNet dataset using the procedure described earlier on the image classification task, the encoder and decoder is fixed as in the earlier scenario. The architectures are then adapted with dilated convolutions, cResNet-d/ResNet-d, and<br />
finetuned on the semantic segmentation task.<br />
<br />
Refer to Appendix A Section A5, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Segmentation Results ====<br />
<br />
The table below shows the mIoU results for the segmentation task.<br />
<br />
[[File:AR_Tab2.png|450|center]]<br />
<br />
The figure below illustrates the segmentation results with respect to each compression operating point.<br />
<br />
[[File:AR_Fig4.png|700|center]]<br />
<br />
For semantic segmentation ResNet-50-d and cResNet-51-d perform equally well at the 0.635 bpp compression operating point. For the<br />
0.330 bpp operating point, segmentation from the compressed representation performs slightly better, 0.37%, and at the 0.0983 bpp operating point segmentation from the compressed representation<br />
performs considerably better than for the reconstructed compressed RGB images, by 1.65%.<br />
<br />
[[File:AR_Fig5.png|600px|center]]<br />
<br />
The above figure shows the predicted segmentation visually for both the cResNet-51-d and the ResNet-50-d<br />
architecture at each operating point. Along with the segmentation, it also shows the original uncompressed<br />
RGB image and the reconstructed compressed RGB image. These images highlight<br />
the challenging nature of these segmentation tasks, but they can nevertheless be performed using the<br />
compressed representation. They also clearly indicate that the compression affects the segmentation,<br />
as lowering the rate (bpp) progressively removes details in the image. Comparing the segmentation<br />
from the reconstructed RGB images to the segmentation from the compressed representation visually,<br />
the performance is similar.<br />
<br />
The figure below is another example of visual results of segmentation from compressed representation and reconstructed RGB<br />
images. The performance is visually similar for all operating points except for the 0.0983<br />
bpp operating point where the reconstructed RGB image fails to capture the back part of<br />
the train, while the compressed representation manages to capture that aspect of the image in the<br />
segmentation.<br />
<br />
[[File:AR_Fig10.png|600px|center]]<br />
<br />
=== Results on Computational Gains ===<br />
<br />
[[File:AR_Fig6.png|400px|center]]<br />
<br />
=====Computational Gains on Classification=====<br />
<br />
The figure on the left illustrates, the top-5 classification accuracy as a function of computational<br />
complexity for the 0.0983 bpp compression operating point.<br />
Looking at a fixed computational cost, the reconstructed compressed RGB images perform about 0.25% better. Looking at a fixed classification cost, inference from the compressed representation costs about 0.6 * 10^9 FLOPs more. However when accounting for the decoding cost at a fixed<br />
classification performance, inference from the reconstructed compressed RGB images costs 2.2*10^9 FLOPs more than inference from the compressed representation.<br />
<br />
=====Computational Gains on Segmentation=====<br />
<br />
In the figure on the right illustrates, the mIoU validation performance is shown as a function of computational complexity for<br />
the 0.0983 bpp compression operating point. <br />
Here, even without accounting for the decoding cost of the reconstructed images, the compressed representation<br />
performs better. At a fixed computational cost, segmentation from the compressed representation gives about 0.7% better mIoU. And at a fixed mIoU the computational cost is about 3.3*10^9 FLOPs<br />
lower for compressed representations. Accounting for the decoding costs this difference becomes 6.1*10^9 FLOPs. due to the nature of the dilated convolutions and the increased feature map size, the<br />
relative computational gains for segmentation are not as pronounced as for classification.<br />
<br />
===Joint Training for Compression and Image Classification===<br />
<br />
==== Training Procedure ====<br />
<br />
When doing joint training, the compression network and the classification networks are first initialized<br />
from a trained state obtained as described previously. After initialization, the networks are<br />
both finetuned jointly. For a detailed<br />
description of hyperparameters used and the training schedule see Appendix A8.<br />
<br />
To control that the change in classification accuracy is not only due to (1) a better compression<br />
operating point or (2) the fact that the cResNet is trained longer, the following is done. A new operating point is obtained by finetuning the compression network only using the schedule described<br />
above. The cResNet-51 is trained on top of this new operating point from scratch. Finally, the compression network is fixed at the new operating point, and the cResNet-51 is trained for 9 epochs. <br />
<br />
To obtain segmentation results, the jointly trained network is used. The operating point is fixed and the jointly finetuned classification network is adopted fro segmentation (cResNet-51-d).<br />
<br />
==== Joint Training Results ====<br />
<br />
[[File:AR_Fig7.png|400px|center]]<br />
<br />
It can be seen from the figure, that the classification and segmentation results “move<br />
up” from the baseline through finetuning. When training jointly the improvement for classification are larger and<br />
a significant improvement for segmentation is achieved. For the 0.635 bpp operating point the classification performance is similar for training the network jointly and training<br />
the compression network only, but when using these operating points for segmentation the difference is considerable.<br />
<br />
The results presented by the authors suggest an improvement in classification by 2%, a performance gain which would<br />
require an additional 75% of the computational complexity of cResNet-51. The segmentation<br />
performance after training the networks jointly is 1.7% better in mIoU than training only<br />
the compression network.<br />
<br />
==Critique==<br />
<br />
The paper proposes how previous work in autoencoders and image compression can be extended effectively to a novel task of a combined image compression and recognition task. The work has provided extensive experimental evaluation and evidence that suggests that learned compressed representations can be effective in classification and segmentation tasks. While maintaining the performance of the techniques to state of the art performance, the authors show that the proposed method can offer significant computational gains. The applications of this can be in<br />
multimedia communication, wireless transmission of images, video surveillance on the mobile edge, etc. With the advent of 5G and other new wireless technologies, this method offers capabilities that can be utilized to conserve wireless bandwidth, savings on storage while retaining the perceptual quality of images.<br />
The joint training of compression and classification network provides some added advantages and also shows that at aggressive compression rates the performance in classification and segmentation can be improved significantly.<br />
<br />
The authors mention that the complexity of the current approach is still high in comparison with methods like JPEG or JPEG2000. They also mention that this can be overcome when the networks are trained and run on GPUs. Although this has been seen as a drawback, with subsequent improvements in physical hardware and more specialized deep learning platforms, the limitation of the current approach can be overcome. Finally, in the light of providing extensive experimental contributions,<br />
the authors have written a quite lengthy paper. There are parts of the paper where the ideas have been repeated frequently, and this could've been avoided leading to a more well-balanced length of the article.<br />
<br />
==Conclusion==<br />
<br />
The paper proposes an inference task using compressed image representations without the need to decode for classification and semantic segmentation. The paper has successfully demonstrated through a set of rigorous experiments the approach<br />
for performing the intended tasks. The results show significant improvements in computational complexity while maintaining state of the art classification and segmentation performance. The authors also intend to explore other computer vision tasks based on using compressed representation as part of the future work. They also suggest that this could potentially lead to gaining a better understanding of the features/compressed representations learned by image compression networks leading to applications in unsupervised or semi-supervised learning.<br />
<br />
==References==<br />
# Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., & Van Gool, L. (2018). Towards image understanding from deep compression without decoding. arXiv preprint arXiv:1803.06131.<br />
# Theis, L., Shi, W., Cunningham, A., & Huszár, F. (2017). Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395.<br />
# Agustsson, E., Mentzer, F., Tschannen, M., Cavigelli, L., Timofte, R., Benini, L., & Gool, L. V. (2017). Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems (pp. 1141-1151).<br />
# He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
# Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=37772stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-11-05T00:33:09Z<p>D35kumar: /* Results and Discussion */ Added details to the experimental setup (Technical)</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. This means they are efficient and simple but hinder the potential for optimal network learning. However, mixed pooling and stochastic pooling use probabilistic approach. So, they can address some problems of deterministic methods. Neighborhood approach is used in all the mentioned pooling methods due to simplicity and efficiency. Nevertheless, it suffers from edge halos, blurring, and aliasing which needs to be minimized. This paper introduces wavelet pooling which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. The tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth data.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': which combines max and average pooling by randomly selecting one over the other during training in three separate ways:<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda. max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda).\frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l; where: l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The proposed pooling method uses wavelets to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=37771stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-11-05T00:23:32Z<p>D35kumar: /* Background */ Correcting minor writing issues [Editorial]</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. This means they are efficient and simple but hinder the potential for optimal network learning. However, mixed pooling and stochastic pooling use probabilistic approach. So, they can address some problems of deterministic methods. Neighborhood approach is used in all the mentioned pooling methods due to simplicity and efficiency. Nevertheless, it suffers from edge halos, blurring, and aliasing which needs to be minimized. This paper introduces wavelet pooling which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. The tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth data.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': which combines max and average pooling by randomly selecting one over the other during training in three separate ways:<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda. max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda).\frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l; where: l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
== Proposed Method ==<br />
<br />
The proposed pooling method uses wavelets to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=37770Learning to Teach2018-11-05T00:10:57Z<p>D35kumar: /* Framework */ Adding more technical details for different states. [Technical]</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system, the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence and specifically machine learning, researchers have focused most of their efforts on the ''student'' ie. designing various optimization algorithms to enhance the learning ability of intelligent agents. The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can select training data which corresponds to choosing the right teaching materials (e.g. textbooks); designing the loss functions corresponding to setting up targeted examinations; defining the hypothesis space corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, a specific problem is chosen, '''training data scheduling''', as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching which can be classified into machine-teaching (Zhu, 2015) [2] and hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works boil down to a lack of formally defined teaching problem as well as the reliance on heuristics and fixed rules for teaching which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
==Problem Definition==<br />
The student model, denoted &mu;(), takes input: the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math>.<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>. Can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.<br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student.<br />
<br />
The optimizer for training the teacher model is maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. Thus for the two image classification tasks, the student model can learn from hard instances of data from the beginning for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=37769Learning to Teach2018-11-05T00:00:25Z<p>D35kumar: Adding more technical context to the introduction [Technical]</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system, the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence and specifically machine learning, researchers have focused most of their efforts on the ''student'' ie. designing various optimization algorithms to enhance the learning ability of intelligent agents. The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can select training data which corresponds to choosing the right teaching materials (e.g. textbooks); designing the loss functions corresponding to setting up targeted examinations; defining the hypothesis space corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, a specific problem is chosen, '''training data scheduling''', as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching which can be classified into machine-teaching (Zhu, 2015) [2] and hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. <br />
<br />
The limitations of these works boil down to a lack of formally defined teaching problem as well as the reliance on heuristics and fixed rules for teaching which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
==Problem Definition==<br />
The student model, denoted &mu;(), takes input: the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math>.<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
<br />
::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.<br />
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math><br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>. Can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.<br />
* <math> φ_θ : S → A </math> is policy used by teach moderl to generate action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math> <br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student.<br />
<br />
The optimizer for training the teacher model is maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]<br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. Thus for the two image classification tasks, the student model can learn from hard instances of data from the beginning for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37317Fairness Without Demographics in Repeated Loss Minimization2018-10-29T03:58:46Z<p>D35kumar: /* Related Works */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As consequence, minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk for all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time and is fair on models that ERM turns unfair by applying it to Amazon Mechanical Turk task.<br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
===Related Works===<br />
The recent advancements on the topic of fairness in Machine learning can be classified into the following approaches:<br />
<br />
1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time ensuring the consent of the minorities thus increasing the chances of maintaining the status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.<br />
<br />
2. Labels of minorities present in the data:<br />
* Chouldechova,2017: Use of race in recidivism protection.<br />
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.<br />
In the case specific to this paper, this information is not present.<br />
<br />
3. Fairness when minority grouping are not present explicitly<br />
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function whereas Kearns et al., 2018; Hebert-Johnson et al., 2017 used subgroups of a set of protected labels.<br />
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.<br />
Again for the specific case in this paper, this is not possible.<br />
<br />
4. Online settings<br />
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.<br />
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. <br />
<br />
The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. <br />
<br />
If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution.<br />
<br />
The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.<br />
<br />
The worst-case risk over all groups can then be defined as,<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).<br />
\end{align}<br />
<br />
Minimizing this function is equivalent to minimizing the risk for the worst-off group. <br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).<br />
<br />
Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses. <br />
<br />
At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.<br />
<br />
<br />
=Critiques=<br />
<br />
This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that, the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at is hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.<br />
<br />
=Other Sources=<br />
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.<br />
<br />
=References=<br />
Rawls, J. Justice as fairness: a restatement. Harvard University Press, 2001.<br />
<br />
Barocas, S. and Selbst, A. D. Big data’s disparate impact. 104 California Law Review, 3:671–732, 2016.<br />
<br />
Chouldechova, A. A study of bias in recidivism prediciton instruments. Big Data, pp. 153–163, 2017<br />
<br />
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp. 214–226, 2012.<br />
<br />
Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2018.<br />
<br />
Hebert-Johnson, ´ U., Kim, M. P., Reingold, O., and Roth-blum, G. N. Calibration for the (computationallyidentifiable) masses. arXiv preprint arXiv:1711.08513, 2017.<br />
<br />
Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. In FATML, 2016.<br />
<br />
Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1617–1626, 2017.<br />
<br />
Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37316Fairness Without Demographics in Repeated Loss Minimization2018-10-29T03:52:38Z<p>D35kumar: /* References */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As consequence, minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk for all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time and is fair on models that ERM turns unfair by applying it to Amazon Mechanical Turk task.<br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
===Related Works===<br />
The following summarises the recent advancements in fairness in Machine learning <br />
<br />
1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time ensuring the consent of the minorities thus increasing the chances of maintaining the status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.<br />
<br />
2. Labels of minorities present in the data:<br />
* Chouldechova,2017: Use of race in recidivism protection.<br />
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.<br />
In the case specific to this paper, this information is not present.<br />
<br />
3. Fairness when minority grouping are not present explicitly<br />
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function whereas Kearns et al., 2018; Hebert-Johnson et al., 2017 used subgroups of a set of protected labels.<br />
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.<br />
Again for the specific case in this paper, this is not possible.<br />
<br />
4. Online settings<br />
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.<br />
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. <br />
<br />
The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. <br />
<br />
If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution.<br />
<br />
The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.<br />
<br />
The worst-case risk over all groups can then be defined as,<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).<br />
\end{align}<br />
<br />
Minimizing this function is equivalent to minimizing the risk for the worst-off group. <br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).<br />
<br />
Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses. <br />
<br />
At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.<br />
<br />
<br />
=Critiques=<br />
<br />
This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that, the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at is hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.<br />
<br />
=Other Sources=<br />
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.<br />
<br />
=References=<br />
Rawls, J. Justice as fairness: a restatement. Harvard University Press, 2001.<br />
<br />
Barocas, S. and Selbst, A. D. Big data’s disparate impact. 104 California Law Review, 3:671–732, 2016.<br />
<br />
Chouldechova, A. A study of bias in recidivism prediciton instruments. Big Data, pp. 153–163, 2017<br />
<br />
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp. 214–226, 2012.<br />
<br />
Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2018.<br />
<br />
Hebert-Johnson, ´ U., Kim, M. P., Reingold, O., and Roth-blum, G. N. Calibration for the (computationallyidentifiable) masses. arXiv preprint arXiv:1711.08513, 2017.<br />
<br />
Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. In FATML, 2016.<br />
<br />
Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1617–1626, 2017.<br />
<br />
Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37315Fairness Without Demographics in Repeated Loss Minimization2018-10-29T03:52:21Z<p>D35kumar: (T) adding references</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As consequence, minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk for all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time and is fair on models that ERM turns unfair by applying it to Amazon Mechanical Turk task.<br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
===Related Works===<br />
The following summarises the recent advancements in fairness in Machine learning <br />
<br />
1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time ensuring the consent of the minorities thus increasing the chances of maintaining the status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.<br />
<br />
2. Labels of minorities present in the data:<br />
* Chouldechova,2017: Use of race in recidivism protection.<br />
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.<br />
In the case specific to this paper, this information is not present.<br />
<br />
3. Fairness when minority grouping are not present explicitly<br />
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function whereas Kearns et al., 2018; Hebert-Johnson et al., 2017 used subgroups of a set of protected labels.<br />
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.<br />
Again for the specific case in this paper, this is not possible.<br />
<br />
4. Online settings<br />
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.<br />
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. <br />
<br />
The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. <br />
<br />
If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution.<br />
<br />
The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.<br />
<br />
The worst-case risk over all groups can then be defined as,<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).<br />
\end{align}<br />
<br />
Minimizing this function is equivalent to minimizing the risk for the worst-off group. <br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).<br />
<br />
Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses. <br />
<br />
At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.<br />
<br />
<br />
=Critiques=<br />
<br />
This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that, the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at is hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.<br />
<br />
=Other Sources=<br />
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.<br />
<br />
==References==<br />
Rawls, J. Justice as fairness: a restatement. Harvard University Press, 2001.<br />
<br />
Barocas, S. and Selbst, A. D. Big data’s disparate impact. 104 California Law Review, 3:671–732, 2016.<br />
<br />
Chouldechova, A. A study of bias in recidivism prediciton instruments. Big Data, pp. 153–163, 2017<br />
<br />
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp. 214–226, 2012.<br />
<br />
Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2018.<br />
<br />
Hebert-Johnson, ´ U., Kim, M. P., Reingold, O., and Roth-blum, G. N. Calibration for the (computationallyidentifiable) masses. arXiv preprint arXiv:1711.08513, 2017.<br />
<br />
Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. In FATML, 2016.<br />
<br />
Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1617–1626, 2017.<br />
<br />
Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37314Fairness Without Demographics in Repeated Loss Minimization2018-10-29T03:50:11Z<p>D35kumar: /* Related Works */</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As consequence, minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk for all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time and is fair on models that ERM turns unfair by applying it to Amazon Mechanical Turk task.<br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
===Related Works===<br />
The following summarises the recent advancements in fairness in Machine learning <br />
<br />
1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time ensuring the consent of the minorities thus increasing the chances of maintaining the status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.<br />
<br />
2. Labels of minorities present in the data:<br />
* Chouldechova,2017: Use of race in recidivism protection.<br />
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.<br />
In the case specific to this paper, this information is not present.<br />
<br />
3. Fairness when minority grouping are not present explicitly<br />
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function.<br />
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.<br />
Again for the specific case in this paper, this is not possible.<br />
<br />
4. Online settings<br />
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.<br />
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. <br />
<br />
The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. <br />
<br />
If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution.<br />
<br />
The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.<br />
<br />
The worst-case risk over all groups can then be defined as,<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).<br />
\end{align}<br />
<br />
Minimizing this function is equivalent to minimizing the risk for the worst-off group. <br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).<br />
<br />
Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses. <br />
<br />
At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.<br />
<br />
<br />
=Critiques=<br />
<br />
This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that, the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at is hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.<br />
<br />
=Other Sources=<br />
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization&diff=37313Fairness Without Demographics in Repeated Loss Minimization2018-10-29T03:43:45Z<p>D35kumar: (T)Added more literature survey and organized the already present work.</p>
<hr />
<div>This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018. <br />
<br />
=Introduction=<br />
<br />
Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity even increases as minority users suffer higher error rates, and therefore, are less likely to use the system in the future. As consequence, minority groups further shrink, and therefore, less data is available for the next optimization of the model. With less data the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''. <br />
<br />
In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk for all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity and its amplification over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time and is fair on models that ERM turns unfair by applying it to Amazon Mechanical Turk task.<br />
<br />
===Note on Fairness===<br />
<br />
Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group rather than the whole group (cf. utilitarianism).<br />
<br />
===Related Works===<br />
The following summarises the recent advancements in fairness in Machine learning <br />
<br />
1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time ensuring the consent of the minorities thus increasing the chances of maintaining the status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.<br />
<br />
2. Labels of minorities present in the data:<br />
* Chouldechova,2017: Use of race in recidivism protection.<br />
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.<br />
In the case specific to this paper, this information is not present.<br />
<br />
3. Fairness when minority grouping are not present explicitly<br />
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function whereas Kearns et al., 2018; Hebert-Johnson et al., 2017 used subgroups of a set of protected labels.<br />
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.<br />
Again for the specific case in this paper, this is not possible.<br />
<br />
4. Online settings<br />
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.<br />
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.<br />
<br />
=Representation Disparity=<br />
<br />
If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>. <br />
<br />
The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>. <br />
<br />
If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution.<br />
<br />
The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.<br />
<br />
The worst-case risk over all groups can then be defined as,<br />
\begin{align}<br />
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).<br />
\end{align}<br />
<br />
Minimizing this function is equivalent to minimizing the risk for the worst-off group. <br />
<br />
There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).<br />
<br />
Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.<br />
<br />
=Disparity Amplification=<br />
<br />
Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses. <br />
<br />
At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by <br />
\begin{align}<br />
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k<br />
\end{align}<br />
<br />
where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>. To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. <br />
<br />
==Empirical Risk Minimization (ERM)==<br />
<br />
Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:<br />
<br />
\begin{align}<br />
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})<br />
\end{align}<br />
<br />
However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.<br />
<br />
=Distributonally Robust Optimization (DRO)=<br />
<br />
To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this sections' formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known. Therefore, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss. <br />
<br />
To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]<br />
\end{align}<br />
<br />
which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.<br />
<br />
==Optimization of DRO==<br />
<br />
To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem<br />
<br />
\begin{align}<br />
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}<br />
\end{align}<br />
<br />
with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.<br />
<br />
<br />
=Critiques=<br />
<br />
This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that, the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at is hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.<br />
<br />
=Other Sources=<br />
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering&diff=37221DCN plus: Mixed Objective And Deep Residual Coattention for Question Answering2018-10-25T03:02:22Z<p>D35kumar: Minor grammar corrections (Editorial)</p>
<hr />
<div>== Introduction ==<br />
Question Answering(QA) is one of the challenging computer science tasks that need an understanding of the natural language and the ability to reason efficiently. To accurately answer the question, the model must first have a detailed understanding of the context the question is being asked from. Because the questions are usually very detailed, having a shallow knowledge from the context would lead to poor and unacceptable performance. Moreover, the model should gather all the information provided in the question and match them with its knowledge from the context. Generating the answer is another interesting task. Based on the dataset the model is meant for, the output of the model might be in a completely different form.<br />
In the past years, QA datasets have improved significantly. Previous datasets were really simple and they usually did not simulate a real-world question-answer pair. For example, Children's book test was one of the popular datasets that have been used for QA for a long time. But the real task for this dataset was to just fill empty spaces in given sentences with the appropriate words. During the past years, the importance of the QA tasks and their practical uses encouraged many to gather and crowdsource useful and more realistic datasets. The Stanford Question Answering Dataset(SQuAD), Microsoft MAchine Reading COmprehension Dataset(MS MARCO), and Visual Question Answering Dataset(VQA) are only a few examples of the currently advanced datasets.<br />
As a result of these advancements, many researchers are focusing to improve the performance of the question answering models on these datasets. Deep neural networks were able to outperform the human accuracy on a few of these datasets, but in many cases, there is still a gap between the state-of-the-art and human performance. Previously, Dynamic Coattention Networks(DCN) proved to be efficient on the SQuAD, achieving state-of-the-art performance at the time. In this work, a further modification to DCN has been done which improves the accuracy of the model by proposing a mixed objective that combines cross entropy loss with self-critical policy learning. Moreover, the rewards used are based on the word overlap to find a solution for the evaluation metric and objective misalignment.<br />
<br />
==Overview of previous work==<br />
Most of the current QA models are made from different modules and usually stacked on top of each other. Improving one of the modules would lead to an overall performance of the model. Thus, to evaluate the efficiency of an improvement, researchers usually take a previously submitted model and replace their own improved module with the current one in the model. This is mostly because QA is an interesting discipline and has practical uses.<br />
<br />
The state of the art approaches to this problem can be divided into 3. <br />
<br />
1. Neural Models for question answering: Models like coattention, bidirectional attention flow and self-matching attention build codependent representations of the question and the document. After building these representations, the models predict the answer by generating the start position and the end position corresponding to the estimated answer span. The generation process utilizes a pointer network. Another approach uses a dynamic decoder that iteratively proposes answers by alternating between start position and end position estimates, which in some cases allows it to recover from initial mistakes in predictions. <br />
<br />
2. Neural Attention Models: Models like self-attention have been applied to language modelling and sentimental analysis. A deep version of the same called deep self-attention networks attained state-of-the-art results in machine translation. Coattention, bidirectional attention and self-matching attention are some of the methods that build codependent representation between the question and the document. <br />
<br />
3. Reinforcement learning in NLP: Hierarchical RL techniques have been proposed for generating text in a simulated way finding domain. DQN have been used to learn policies in text-based games using game rewards as feedback. Neural conversational model have been proposed, that is trained using policy gradient methods, whose reward function consisted of heuristics for ease of answering, information flow, and semantic coherence. General actor-critic temporal-difference methods for sequence prediction have also been experimented, performing metric optimization on language modelling and machine translation. Direct word overlap metric optimization have also been applied to summarization and machine translation. <br />
<br />
<br />
==Important Terms==<br />
<br />
#Embedding layer: This layer maps each word (or images in the case of visual QA) to a vector space. There are many options to choose for the embedding layer. While pre-trained GloVes or Word2Vecs showed promising results on many tasks, most models use a combination of GloVe and character level embeddings. The character level embeddings are especially useful when dealing with out-of-vocab words. In the case of dealing with images, the embeddings are usually generated using pre-trained ResNets. Using different embedding layers for images has shown to change the overall performance of the model drastically.<br />
#Contextual_layer: The purpose of this layer is to add more features to each word embedding based on the surrounding words and the context. This layer is not presented in many models including the DCN.<br />
#Attention layer: There has been a lot of investigation on the attention mechanisms in recent years. These works, mostly inspired by Bahdanau et al. (2014), try to either modify the basic matrix-based attention mechanism or to develop innovative ones. The sole purpose of the attention mechanism is to make the model able to understand a context, based on the information gathered from somewhere else. For example, in image-based QA, attention layer helps the model to understand the question based on the information provided in the image such as object classes. This way, the model can realize what parts of the question are more important. This model uses '''co-attention layers''' (Xiong, 2017). Given two inputs sources (text and question), internal representations are built conditioned on one of the sources. In a way, this can be thought of as retaining (attending to) parts of the input that are relevant to the other source. From the text, only parts that are 'useful' to the question are kept, while from the question, parts that are useful for the text are retained. The intuition stems from the fact that it is easier to answer a question from a text, knowing the question beforehand compared with when the question is only available at the end. In the former, only information relevant to the question is kept, while in the latter case, all information from the text needs to be kept.<br />
#Output layer: This is the final layer of all models, generating the answer of the question based on the information provided from all the previous layers.<br />
<br />
==DCN+ structure==<br />
The DCN+ is an improvement on the previous DCN model. The overall structure of the model is the same as before. The first improvement is on the coattention module. By introducing a deep residual coattention encoder, the output of the attention layer becomes more feature-rich. The second improvement is achieved by mixing the previous cross-entropy loss with reinforcement learning rewards from self-critical policy learning. DCN+ has a decoder module that is only applicable to the SQuAD dataset since the decoder only predicts an answer span from the given context.<br />
<br />
===Deep residual coattention encoder===<br />
The previous coattention module was unable to grasp complex information based on the context and the question. Recent studies showed that stacked attention mechanisms are outperforming the single layer attention modules. In DCN+, the coattention module is stacked to make it able to self-attend to the context and grasp more information. The second modification is to use residual connectors when merging the coattention output from each layer.<br />
<br />
[[File:Coattention.png|700px|centre]]<br />
<br />
let <math>L^D \in R^{m×d}</math> and <math>L^Q \in R^{n×d}</math> denote the word embedding for the context and the question respectively. Here, <math>d, m, n</math> are the embedding vector size, document word count, and question word count respectively. The model uses a bidirectional LSTM as the contextual layer with shared wights. Also, an additional sentinel token is added at the end of the document and question to make it possible for the model to distinguish between the document and question. <math>E^D</math> and <math>E^Q</math> are outputs of the encoder(contextual) layer.<br />
<br />
\begin{align}<br />
E_1^D = BiLSTM_1(L^D) \in R^{(h×(m+1))}<br />
\end{align}<br />
\begin{align}<br />
E_1^Q = tanh(W BiLSTM_1(L^Q) \in R^{(h×(n+1))}<br />
\end{align}<br />
<br />
Here <math>h</math> is the hidden size of the LSTM. The affinity matrix is created based on the output of the encoder. The affinity matrix is the matrix that the has been used in the attention module from the introduction of attention. By performing a column-wise softmax function on the affinity matrix a vector would be generated that is a representation of the importance of each question token, based on the model's understanding of the context. Similarly, if a row-wise softmax function is applied to the affinity matrix, the output vector would represent the importance of each context word, based on the question. By multiplying these vectors to the outputs of the encoder layer, question-aware context and context-aware question representations would be created.<br />
<br />
\begin{align}<br />
A = {(E_1^D)}^T E_1^Q \in R^{(m+1)×(n+1)}<br />
\end{align}<br />
\begin{align}<br />
{S_1^D} = E_1^Q softmax(A^T) \in R^{h×(m+1)}<br />
\end{align}<br />
\begin{align}<br />
{S_1^Q} = E_1^D softmax(A) \in R^{h×(n+1)}<br />
\end{align}<br />
<br />
To make the question-aware context representation even deeper and more feature-rich, an output (called the co-attention context, <math> C_1^D </math>) of the first co-attention layer is fed directly into the decoder using a residual connection. <br />
<br />
\begin{align}<br />
{C_1^D} = S_1^Q softmax(A^T) \in R^{h×m}<br />
\end{align}<br />
<br />
Note that the model drops the dimension corresponding to the sentinel vector. The summaries also get encoded after this stage, using two bidirectional LSTMs with shared variables.<br />
<br />
\begin{align}<br />
{E_2^D} = BiLSTM_2(S_1^Q) \in R^{2h×m}<br />
\end{align}<br />
\begin{align}<br />
{E_2^Q} = BiLSTM_2(S_1^D) \in R^{2h×n}<br />
\end{align}<br />
<br />
Finally, The <math>E_2^D</math> and <math>E_2^Q</math> are fed into the second co-attention layer. Similar to the first co-attention layer, three outputs are produced, <math>S_2^D, S_2^Q, C_2^D </math>. However, <math>S_2^Q</math> is not used. These co-attention modules can easily get stacked to create a deeper attention mechanism. <br />
<br />
The output of the second co-attention layer are concatenated with residual connections from <math>C_1^D, S_1^D, E_2^D</math>. The final output of model is obtained by passing the concatenated representation through another bi-direction LSTM:<br />
<br />
\begin{align}<br />
U = BiLSTM(concat(E_1^D;E_2^D;S_1^D;S_2^D;C_1^D;C_2^D) \in R^{2h×m}<br />
\end{align}<br />
<br />
===Mixed objective using self-critical policy learning===<br />
DCN produces a distribution over that start and end positions of the answer span. Because of the dynamic nature of the decoder module, it estimates separate distributions over the start and end position of the answer dynamically.<br />
<br />
\begin{align}<br />
l_{ce}(\theta) = - \sum_{t} (log \ p_t^{start}(s|s_{t-1},e_{t-1};\theta) + log \ p_t^{end}(e|s_{t-1},e_{t-1};\theta))<br />
\end{align}<br />
<br />
In the above equation, <math>s</math> and <math>e</math> denote the respective start and end points of the ground truth answer. <math>s_t</math> and <math>e_t</math> denote the greedy estimation of the start and end positions at the <math>t</math>th decoding time step. Similarly, <math>p_t^{start} \in R^m</math> and <math>p_t^{end} \in R^m</math> denote the distribution of the start and end positions respectively. The problem with the above loss functions is that it does not consider the F1 metric for evaluation of the model. There are two metrics to estimate QA models accuracy. The first metric is the exact match and it is a binary score. If the answer string does not match with the ground truth answer even by a single character, the exact match score would be zero. The second metric is the F1 score. F1 score is basically the degree of the overlap between the predicted answer and the ground truth. <br />
For example, suppose there are more than two correct answer spans in a context, <math>A</math> and <math>B</math>, but none of the match the ground truth positions. If A has an exact string match but B does not, The cross-entropy loss would penalize both of them equally. However, if we include can F1 scores in our calculations, the loss function would penalize B and not A. <br />
<br />
The main problem with including F1 score directly into cost functions is that it is non-differentiable. A trick from (Sutton et al.,1999; Schulman et al., 2015) is used to approximate the expected gradient. <br />
For this, DCN+ uses a self-critical reinforcement learning objective.<br />
<br />
\begin{align}<br />
l_{rl}(\theta) = -E_{\hat{\tau} \sim p_\tau} [R(s,e,\hat{s}_T,\hat{e}_T;\theta)]<br />
\end{align}<br />
<br />
\begin{align}<br />
\approx -E_{\hat{\tau} \sim p_\tau} [F_1 (ans(\hat{s}_T, \hat{e}_T), ans(s, e)) - F_1(ans(s_T, e_T), ans(s, e))]<br />
\end{align}<br />
<br />
Here <math>\hat{s} \sim p_t^{start}</math> and <math>\hat{e} \sim p_t^{end}</math> denote the sampled start and end positions respectively from the estimated distributions at <math>t</math>th decoding step. <math>\hat{\tau}</math> is the sequence of sampled start and end positions during all <math>T</math> decoder steps, <math>R</math> is the expected reward, <math>F_1</math> is the F1 score between the predicted answer and the expected answer. Rather than using the raw F1 score, mean subtracted F1 score is used (baseline). Previous studies show that using a baseline for the reward reduces the variance of gradient estimates and facilitates convergence. DCN+ uses a self-critic that uses the F1 produced during greedy inference by the current model.<br />
<br />
[[File:loss.png|700px|centre]]<br />
<br />
==Dataset==<br />
The dataset SQuAD (Reference: Stanford NP Group) was used in training the network. The SQuAD 1.1 dataset contains 100 000 questions, based on a set of Wikipedia articles. These questions are designed to be answered by a segment of text from the article. The solutions to each question are represented by a start location, and the text of the answer. An example question-answer pair of the SQuAD 2.0 dataset is: Q: "When did Beyonce start becoming popular?", A: (text: "in the late 1990s", start: 269).<br />
<br />
The SQuAD 2.0 dataset augments the SQuAD 1.1 collection with 50 000 unanswerable questions, designed adversarily. Samples were generated by crowdworkers in both cases.<br />
<br />
==Experiments==<br />
To achieve optimal performance, the hyperparameters and training environment are fine-tuned. The hyperparameters of DCN are duplicated. The model was trained and evaluated using the Stanford Question Answering Dataset (SQuAD). For tokenizing the documents, the Stanford CoreNLP reversible tokenizers has been used. For word embeddings, a pre-trained GloVE (trained on 840B common crawl), as well as character ngram embeddings by Hashimoto et al. (2017), is used. Furthermore, these embeddings are then concatenated with context vectors (CoVe) trained on WMT. Words which are not found in the vocabulary have their embedding and context vectors set to zero. The optimizer has been set to Adam and a dropout is also applied on word embeddings that zeros a word embedding with a probability of 0.075. PyTorch is used to build the model.<br />
<br />
==Results==<br />
At the time of submission, the model was able to achieve state-of-the-art results on the SQuAD, outperforming the second model on the leaderboard by 2.0% both on the exact match and F1 scores. It is worth mentioning that a 5% improvement was also achieved with respect to the original DCN model.<br />
<br />
[[File:dcn_resutls1.png|700px|centre]]<br />
<br />
In general, DCN+ was able to a achieve consistent performance improvement in almost every question category.<br />
<br />
[[File:dcn_results2.png|700px|centre]]<br />
<br />
===Ablation Study===<br />
An analysis of the significance of each part of the model found that the deep residual coattention contributed the most to the overall performance. The second highest contributor was the mixed objective. The sparse mixture of experts layer in the decoder also provided some minor contributions to improving the overall performance.<br />
<br />
==Summary and Critiques==<br />
<br />
This paper introduces a novel model for the task of question answering where the cross-entropy loss commonly used for such problems previously has been combined with self-critical policy learning. The rewards are obtained from word overlap to solve misalignment metric and optimization objective. This paper improves the state of the art in a popular question-answer data set. The critical drawback in this paper is that it only shows experimental improvements on one question answer dataset. Previous works in the same field have considered performances on at least three different comprehensive question answer data sets. This paper is only an incremental improvement over the previous algorithm DCN which was released a year back. For the policy learning objective, the authors consider the task as a multi-task learning problem where the dual losses are linearly combined. The authors should have used a weighted combination instead as the positional match objective using cross entropy is far more important than the word overlap objective with ground truth. Additionally, some methods adopted by the authors are not intuitive and not much explanation is given for the same. For example, it is not very clear why the F1 scores have been used as RL rewards as against some other distance objectives commonly used in previous works in the same field like cross entropy. The authors mention a common problem in using Reinforcement learning in NLP problems. NLP domains are discontinuous and discrete domains which the agents have to repeatedly explore to find a good policy. RL is very data hungry, but NLP domains don't offer sufficient datasets for exploration in most cases. The paper says that it is treating the optimization problem as a multi-task learning problem to get around the exploration problem. It is not clear how this is effected. <br />
<br />
==Other Sources==<br />
# An easy to understand blog on the base DCN model can be found at [https://einstein.ai/research/blog/state-of-the-art-deep-learning-model-for-question-answering].<br />
# Tensorflow Source code for this model can be found at [https://github.com/andrejonasson/dynamic-coattention-network-plus]<br />
<br />
<br />
<br />
==References==<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly<br />
learning to align and translate. In ICLR, 2015.<br />
<br />
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,<br />
Aaron C. Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In<br />
ICLR, 2017.<br />
<br />
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain<br />
questions. In ACL, 2017.<br />
<br />
Nina Dethlefs and Heriberto Cuayahuitl. Combining hierarchical reinforcement learning and ´<br />
bayesian networks for natural language generation in situated dialogue. In Proceedings of the<br />
13th European Workshop on Natural Language Generation, pp. 110–120. Association for Computational<br />
Linguistics, 2011.<br />
<br />
Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient<br />
estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530, 2001.<br />
<br />
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task<br />
model: Growing a neural network for multiple NLP tasks. In EMNLP, 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.<br />
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–<br />
778, 2016.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9 8:<br />
1735–80, 1997.<br />
<br />
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses<br />
for scene geometry and semantics. CoRR, abs/1705.07115, 2017.<br />
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,<br />
abs/1412.6980, 2014.<br />
<br />
Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In NIPS, 1999.<br />
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement<br />
learning for dialogue generation. In EMNLP, 2016.<br />
<br />
Rui Liu, Junjie Hu, Wei Wei, Zi Yang, and Eric Nyberg. Structural embedding of syntactic trees for<br />
machine comprehension. In ACL, 2017.<br />
<br />
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention<br />
for visual question answering. In NIPS, 2016.<br />
<br />
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In ACL, 2014.<br />
<br />
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In NIPS, 2017.<br />
<br />
Microsoft Asia Natural Language Computing Group. R-net: Machine reading comprehension with self-matching networks. 2017.<br />
<br />
Karthik Narasimhan, Tejas D. Kulkarni, and Regina Barzilay. Language understanding for textbased games using deep reinforcement learning. In EMNLP, 2015.<br />
<br />
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. CoRR, abs/1705.04304, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.<br />
<br />
John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, 2015.<br />
<br />
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017.<br />
<br />
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.<br />
<br />
Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1047–1055. ACM, 2017.<br />
<br />
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. <br />
<br />
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.<br />
<br />
Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. In ICLR, 2017.<br />
<br />
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In CoNLL, 2017.<br />
<br />
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.<br />
<br />
Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question<br />
answering. In ICLR, 2017.<br />
<br />
Stanford NLP Group. Squad 2.0: The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/. Accessed October 24, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering&diff=37220DCN plus: Mixed Objective And Deep Residual Coattention for Question Answering2018-10-25T02:54:40Z<p>D35kumar: Minor grammar corrections (Editorial)</p>
<hr />
<div>== Introduction ==<br />
Question Answering(QA) is one of the challenging computer science tasks that need an understanding of the natural language and the ability to reason efficiently. To accurately answer the question, the model must first have a detailed understanding of the context the question is being asked from. Because the questions are usually very detailed, having a shallow knowledge from the context would lead to poor and unacceptable performance. Moreover, The model should gather all the information provided in the question and match them with its knowledge from the context. Generating the answer is another interesting task. Based on the dataset the model is meant for, the output of the model might be in a completely different form.<br />
In the past years, QA datasets have improved significantly. Previous datasets were really simple and they usually did not simulate a real-world question-answer pair. For example, Children's book test was one of the popular datasets that have been used for QA for a long time. But the real task for this dataset was to just fill empty spaces in given sentences with the appropriate words. During the past years, the importance of the QA tasks and their practical uses encouraged many to gather and crowdsource useful and more realistic datasets. The Stanford Question Answering Dataset(SQuAD), Microsoft MAchine Reading COmprehension Dataset(MS MARCO), and Visual Question Answering Dataset(VQA) are only a few examples of the currently advanced datasets.<br />
As a result of these advancements, many researchers are focusing to improve the performance of the question answering models on these datasets. Deep neural networks were able to outperform the human accuracy on a few of these datasets, but in many cases, there is still a gap between the state-of-the-art and human performance. Previously, Dynamic Coattention Networks(DCN) proved to be efficient on the SQuAD, achieving state-of-the-art performance at the time. In this work, a further modification to DCN has been done which improves the accuracy of the model by proposing a mixed objective that combines cross entropy loss with self-critical policy learning. Moreover, the rewards used are based on the word overlap to find a solution for the evaluation metric and objective misalignment.<br />
<br />
==Overview of previous work==<br />
Most of the current QA models are made from different modules and usually stacked on top of each other. Improving one of the modules would lead to an overall performance of the model. Thus, to evaluate the efficiency of an improvement, researchers usually take a previously submitted model and replace their own improved module with the current one in the model. This is mostly because QA is an interesting discipline and has practical uses.<br />
<br />
The state of the art approaches to this problem can be divided into 3. <br />
<br />
1. Neural Models for question answering: Models like coattention, bidirectional attention flow and self matching attention build codependent representations of the question and the document. After building these representations, the models predict the answer by generating the start position and the end position corresponding to the estimated answer span. The generation process utilizes a pointer network. Another approach uses a dynamic decoder that iteratively proposes answers by alternating between start position and end position estimates, which in some cases allows it to recover from initial mistakes in predictions. <br />
<br />
2. Neural Attention Models: Models like self-attention have been applied to language modelling and sentimental analysis. Deep version of the same called deep self-attention networks attained state-of-the-art results in machine translation. Coattention, bidirectional attention and self matching attention are some of the methods that build codependent representation between the question and the document. <br />
<br />
3. Reinforcement learning in NLP: Hierarchical RL techniques have been proposed for generating text in a simulated way finding domain. DQN have been used to learn policies in text-based games using game rewards as feedback. Neural conversational model have been proposed, that is trained using policy gradient methods, whose reward function consisted of heuristics for ease of answering, information flow, and semantic coherence. General actor-critic temporal-difference methods for sequence prediction have also been experimented, performing metric optimization on language modelling and machine translation. Direct word overlap metric optimization have also been applied to summarization and machine translation. <br />
<br />
<br />
==Important Terms==<br />
<br />
#Embedding layer: This layer maps each word (or images in the case of visual QA) to a vector space. There are many options to choose for the embedding layer. While pre-trained GloVes or Word2Vecs showed promising results on many tasks, most models use a combination of GloVe and character level embeddings. The character level embeddings are especially useful when dealing with out-of-vocab words. In the case of dealing with images, the embeddings are usually generated using pre-trained ResNets. Using different embedding layers for images has shown to change the overall performance of the model drastically.<br />
#Contextual_layer: The purpose of this layer is to add more features to each word embedding based on the surrounding words and the context. This layer is not presented in many models including the DCN.<br />
#Attention layer: There has been a lot of investigation on the attention mechanisms in recent years. These works, mostly inspired by Bahdanau et al. (2014), try to either modify the basic matrix-based attention mechanism or to develop innovative ones. The sole purpose of the attention mechanism is to make the model able to understand a context, based on the information gathered from somewhere else. For example, in image-based QA, attention layer helps the model to understand the question based on the information provided in the image such as object classes. This way, the model can realize what parts of the question are more important. This model uses '''co-attention layers''' (Xiong, 2017). Given two inputs sources (text and question), internal representations are built conditioned on one of the sources. In a way, this can be thought of as retaining (attending to) parts of the input that are relevant to the other source. From the text, only parts that are 'useful' to the question are kept, while from the question, parts that are useful for the text are retained. The intuition stems from the fact that it is easier to answer a question from a text, knowing the question beforehand compared with when the question is only available at the end. In the former, only information relevant to the question is kept, while in the latter case, all information from the text needs to be kept.<br />
#Output layer: This is the final layer of all models, generating the answer of the question based on the information provided from all the previous layers.<br />
<br />
==DCN+ structure==<br />
The DCN+ is an improvement on the previous DCN model. The overall structure of the model is the same as before. The first improvement is on the coattention module. By introducing a deep residual coattention encoder, the output of the attention layer becomes more feature-rich. The second improvement is achieved by mixing the previous cross-entropy loss with reinforcement learning rewards from self-critical policy learning. DCN+ has a decoder module that is only applicable to the SQuAD dataset since the decoder only predicts an answer span from the given context.<br />
<br />
===Deep residual coattention encoder===<br />
The previous coattention module was unable to grasp complex information based on the context and the question. Recent studies showed that stacked attention mechanisms are outperforming the single layer attention modules. In DCN+, the coattention module is stacked to make it able to self-attend to the context and grasp more information. The second modification is to use residual connectors when merging the coattention output from each layer.<br />
<br />
[[File:Coattention.png|700px|centre]]<br />
<br />
let <math>L^D \in R^{m×d}</math> and <math>L^Q \in R^{n×d}</math> denote the word embedding for the context and the question respectively. Here, <math>d, m, n</math> are the embedding vector size, document word count, and question word count respectively. The model uses a bidirectional LSTM as the contextual layer with shared wights. Also, an additional sentinel token is added at the end of the document and question to make it possible for the model to distinguish between the document and question. <math>E^D</math> and <math>E^Q</math> are outputs of the encoder(contextual) layer.<br />
<br />
\begin{align}<br />
E_1^D = BiLSTM_1(L^D) \in R^{(h×(m+1))}<br />
\end{align}<br />
\begin{align}<br />
E_1^Q = tanh(W BiLSTM_1(L^Q) \in R^{(h×(n+1))}<br />
\end{align}<br />
<br />
Here <math>h</math> is the hidden size of the LSTM. The affinity matrix is created based on the output of the encoder. The affinity matrix is the matrix that the has been used in the attention module from the introduction of attention. By performing a column-wise softmax function on the affinity matrix a vector would be generated that is a representation of the importance of each question token, based on the model's understanding of the context. Similarly, if a row-wise softmax function is applied to the affinity matrix, the output vector would represent the importance of each context word, based on the question. By multiplying these vectors to the outputs of the encoder layer, question-aware context and context-aware question representations would be created.<br />
<br />
\begin{align}<br />
A = {(E_1^D)}^T E_1^Q \in R^{(m+1)×(n+1)}<br />
\end{align}<br />
\begin{align}<br />
{S_1^D} = E_1^Q softmax(A^T) \in R^{h×(m+1)}<br />
\end{align}<br />
\begin{align}<br />
{S_1^Q} = E_1^D softmax(A) \in R^{h×(n+1)}<br />
\end{align}<br />
<br />
To make the question-aware context representation even deeper and more feature-rich, an output (called the co-attention context, <math> C_1^D </math>) of the first co-attention layer is fed directly into the decoder using a residual connection. <br />
<br />
\begin{align}<br />
{C_1^D} = S_1^Q softmax(A^T) \in R^{h×m}<br />
\end{align}<br />
<br />
Note that the model drops the dimension corresponding to the sentinel vector. The summaries also get encoded after this stage, using two bidirectional LSTMs with shared variables.<br />
<br />
\begin{align}<br />
{E_2^D} = BiLSTM_2(S_1^Q) \in R^{2h×m}<br />
\end{align}<br />
\begin{align}<br />
{E_2^Q} = BiLSTM_2(S_1^D) \in R^{2h×n}<br />
\end{align}<br />
<br />
Finally, The <math>E_2^D</math> and <math>E_2^Q</math> are fed into the second co-attention layer. Similar to the first co-attention layer, three outputs are produced, <math>S_2^D, S_2^Q, C_2^D </math>. However, <math>S_2^Q</math> is not used. These co-attention modules can easily get stacked to create a deeper attention mechanism. <br />
<br />
The output of the second co-attention layer are concatenated with residual connections from <math>C_1^D, S_1^D, E_2^D</math>. The final output of model is obtained by passing the concatenated representation through another bi-direction LSTM:<br />
<br />
\begin{align}<br />
U = BiLSTM(concat(E_1^D;E_2^D;S_1^D;S_2^D;C_1^D;C_2^D) \in R^{2h×m}<br />
\end{align}<br />
<br />
===Mixed objective using self-critical policy learning===<br />
DCN produces a distribution over that start and end positions of the answer span. Because of the dynamic nature of the decoder module, it estimates separate distributions over the start and end position of the answer dynamically.<br />
<br />
\begin{align}<br />
l_{ce}(\theta) = - \sum_{t} (log \ p_t^{start}(s|s_{t-1},e_{t-1};\theta) + log \ p_t^{end}(e|s_{t-1},e_{t-1};\theta))<br />
\end{align}<br />
<br />
In the above equation, <math>s</math> and <math>e</math> denote the respective start and end points of the ground truth answer. <math>s_t</math> and <math>e_t</math> denote the greedy estimation of the start and end positions at the <math>t</math>th decoding time step. Similarly, <math>p_t^{start} \in R^m</math> and <math>p_t^{end} \in R^m</math> denote the distribution of the start and end positions respectively. The problem with the above loss functions is that it does not consider the F1 metric for evaluation of the model. There are two metrics to estimate QA models accuracy. The first metric is the exact match and it is a binary score. If the answer string does not match with the ground truth answer even by a single character, the exact match score would be zero. The second metric is the F1 score. F1 score is basically the degree of the overlap between the predicted answer and the ground truth. <br />
For example, suppose there are more than two correct answer spans in a context, <math>A</math> and <math>B</math>, but none of the match the ground truth positions. If A has an exact string match but B does not, The cross-entropy loss would penalize both of them equally. However, if we include can F1 scores in our calculations, the loss function would penalize B and not A. <br />
<br />
The main problem with including F1 score directly into cost functions is that it is non-differentiable. A trick from (Sutton et al.,1999; Schulman et al., 2015) is used to approximate the expected gradient. <br />
For this, DCN+ uses a self-critical reinforcement learning objective.<br />
<br />
\begin{align}<br />
l_{rl}(\theta) = -E_{\hat{\tau} \sim p_\tau} [R(s,e,\hat{s}_T,\hat{e}_T;\theta)]<br />
\end{align}<br />
<br />
\begin{align}<br />
\approx -E_{\hat{\tau} \sim p_\tau} [F_1 (ans(\hat{s}_T, \hat{e}_T), ans(s, e)) - F_1(ans(s_T, e_T), ans(s, e))]<br />
\end{align}<br />
<br />
Here <math>\hat{s} \sim p_t^{start}</math> and <math>\hat{e} \sim p_t^{end}</math> denote the sampled start and end positions respectively from the estimated distributions at <math>t</math>th decoding step. <math>\hat{\tau}</math> is the sequence of sampled start and end positions during all <math>T</math> decoder steps, <math>R</math> is the expected reward, <math>F_1</math> is the F1 score between the predicted answer and the expected answer. Rather than using the raw F1 score, mean subtracted F1 score is used (baseline). Previous studies show that using a baseline for the reward reduces the variance of gradient estimates and facilitates convergence. DCN+ uses a self-critic that uses the F1 produced during greedy inference by the current model.<br />
<br />
[[File:loss.png|700px|centre]]<br />
<br />
==Dataset==<br />
The dataset SQuAD (Reference: Stanford NP Group) was used in training the network. The SQuAD 1.1 dataset contains 100 000 questions, based on a set of Wikipedia articles. These questions are designed to be answered by a segment of text from the article. The solutions to each question are represented by a start location, and the text of the answer. An example question-answer pair of the SQuAD 2.0 dataset is: Q: "When did Beyonce start becoming popular?", A: (text: "in the late 1990s", start: 269).<br />
<br />
The SQuAD 2.0 dataset augments the SQuAD 1.1 collection with 50 000 unanswerable questions, designed adversarily. Samples were generated by crowdworkers in both cases.<br />
<br />
==Experiments==<br />
To achieve optimal performance, the hyperparameters and training environment are fine-tuned. The hyperparameters of DCN are duplicated. The model was trained and evaluated using the Stanford Question Answering Dataset (SQuAD). For tokenizing the documents, the Stanford CoreNLP reversible tokenizers has been used. For word embeddings, a pre-trained GloVE (trained on 840B common crawl) as well as character ngram embeddings by Hashimoto et al. (2017) is used. Furthermore, these embeddings are then concatenated with context vectors (CoVe) trained on WMT. Words which are not found in the vocabulary have their embedding and context vectors set to zero. The optimizer has been set to Adam and a dropout is also applied on word embeddings that zeros a word embedding with a probability of 0.075. PyTorch is used to build the model.<br />
<br />
==Results==<br />
At the time of submission, the model was able to achieve state-of-the-art results on the SQuAD, outperforming the second model on the leaderboard by 2.0% both on the exact match and F1 scores. It is worth mentioning that a 5% improvement was also achieved with respect to the original DCN model.<br />
<br />
[[File:dcn_resutls1.png|700px|centre]]<br />
<br />
In general, DCN+ was able to a achieve consistent performance improvement in almost every question category.<br />
<br />
[[File:dcn_results2.png|700px|centre]]<br />
<br />
===Ablation Study===<br />
An analysis of the significance of each part of the model found that the deep residual coattention contributed the most to the overall performance. The second highest contributor was the mixed objective. The sparse mixture of experts layer in the decoder also provided some minor contributions to improving the overall performance.<br />
<br />
==Summary and Critiques==<br />
<br />
This paper introduces a novel model for the task of question answering where the cross entropy loss commonly used for such problems previously has been combined with self critical policy learning. The rewards are obtained from word overlap to solve misalignment metric and optimization objective. This paper improves the state of the art in a popular question answer data set. The critical drawback in this paper is that it only shows experimental improvements on one question answer dataset. Previous works in the same field have considered performances on at least three different comprehensive question answer data sets. This paper is only an incremental improvement over the previous algorithm DCN which was released a year back. For the policy learning objective, the authors consider the task as a multi task learning problem where the dual losses are linearly combined. The authors should have used a weighted combination instead as the positional match objective using cross entropy is far more important than the word overlap objective with ground truth. Additionally some methods adapted by the authors are not intuitive and not much explanation is given for the same. For example, it is not very clear why the F1 scores have been used as RL rewards as against some other distance objectives commonly used in previous works in the same field like cross entropy. The authors mention a common problem in using Reinforcement learning in NLP problems. NLP domains are discontinuous and discrete domains which the agents have to repeatedly explore to find a good policy. RL is very data hungry, but NLP domains don't offer sufficient datasets for exploration in most cases. The paper says that it is treating the optimization problem as a multi task learning problem to get around the exploration problem. It is not clear how this is effected. <br />
<br />
==Other Sources==<br />
# An easy to understand blog on the base DCN model can be found at [https://einstein.ai/research/blog/state-of-the-art-deep-learning-model-for-question-answering].<br />
# Tensorflow Source code for this model can be found at [https://github.com/andrejonasson/dynamic-coattention-network-plus]<br />
<br />
<br />
<br />
==References==<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly<br />
learning to align and translate. In ICLR, 2015.<br />
<br />
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,<br />
Aaron C. Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In<br />
ICLR, 2017.<br />
<br />
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer opendomain<br />
questions. In ACL, 2017.<br />
<br />
Nina Dethlefs and Heriberto Cuayahuitl. Combining hierarchical reinforcement learning and ´<br />
bayesian networks for natural language generation in situated dialogue. In Proceedings of the<br />
13th European Workshop on Natural Language Generation, pp. 110–120. Association for Computational<br />
Linguistics, 2011.<br />
<br />
Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient<br />
estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530, 2001.<br />
<br />
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task<br />
model: Growing a neural network for multiple NLP tasks. In EMNLP, 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.<br />
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–<br />
778, 2016.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9 8:<br />
1735–80, 1997.<br />
<br />
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses<br />
for scene geometry and semantics. CoRR, abs/1705.07115, 2017.<br />
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,<br />
abs/1412.6980, 2014.<br />
<br />
Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In NIPS, 1999.<br />
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement<br />
learning for dialogue generation. In EMNLP, 2016.<br />
<br />
Rui Liu, Junjie Hu, Wei Wei, Zi Yang, and Eric Nyberg. Structural embedding of syntactic trees for<br />
machine comprehension. In ACL, 2017.<br />
<br />
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention<br />
for visual question answering. In NIPS, 2016.<br />
<br />
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In ACL, 2014.<br />
<br />
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In NIPS, 2017.<br />
<br />
Microsoft Asia Natural Language Computing Group. R-net: Machine reading comprehension with self-matching networks. 2017.<br />
<br />
Karthik Narasimhan, Tejas D. Kulkarni, and Regina Barzilay. Language understanding for textbased games using deep reinforcement learning. In EMNLP, 2015.<br />
<br />
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. CoRR, abs/1705.04304, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.<br />
<br />
John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, 2015.<br />
<br />
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017.<br />
<br />
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.<br />
<br />
Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1047–1055. ACM, 2017.<br />
<br />
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. <br />
<br />
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.<br />
<br />
Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. In ICLR, 2017.<br />
<br />
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In CoNLL, 2017.<br />
<br />
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.<br />
<br />
Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question<br />
answering. In ICLR, 2017.<br />
<br />
Stanford NLP Group. Squad 2.0: The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/. Accessed October 24, 2018.</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs&diff=36934stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs2018-10-20T03:47:40Z<p>D35kumar: /* Linearizing activation functions */</p>
<hr />
<div>== Introduction ==<br />
The main reason behind the recent success of LSTM (Long Short-Term Memory Networks) and deep neural networks, in general, has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition(CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model.<br />
<br />
==Overview of previous work==<br />
<br />
There has been research towards developing methods to understand the evaluations provided by LSTMs. Some of them are in line with the work done in this particular paper while others have followed some different approaches.<br />
<br />
Approaches similar to the one provided in this paper - All of the approaches in this category have tried to look into computing just the word-level importance scores with varying evaluation methods.<br />
#Murdoch & Szlam (2017): introduced a decomposition of the LSTM's output embedding and learned the importance score of certain words and phrases from those words (sum of the importance of words). A classifier is then built that searches for these learned phrases that are important and predicts the associated class which is then compared with the output of the LSTMs for validation.<br />
#Li et al. (2016): (Leave one out) They observed the change in log probability of a function by replacing a word vector (with zero) and studying the change in the prediction of the LSTM. It is completely anecdotal. Additionally provided an RL model to find the minimal set of words that must be erased to change the model's decision (although this is irrelevant to interpretability).<br />
#Sundararajan et al. 2017 (Integrated Gradients): a general gradient based technique to learn importance evaluated theoretically and empirically. Built up as an improvement to methods which were trying to do something quite similar. It is tested on image, text and chemistry models.<br />
<br />
Decomposition-based approaches for CNN:<br />
#Bach et al. 2015: proposed a solution to the problem of understanding classification decisions of CNNs by pixel-wise decomposition. Pixel contributions are visualized as heat maps.<br />
#Shrikumar et al. 2016 (DeepLift): an algorithm based on a method similar to backpropagation to learn the importance score for the inputs for a given output. The algorithm to learn these input scores is not dependent on the gradient therefore learning can also happen if the gradient is zero during backpropagation.<br />
<br />
Focussing on analyzing the gate activations:<br />
#Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. For eg. a cell being activated for keeping track of open parathesis or quotes.<br />
#Strobelt et al. 2016: built a visual tool for understanding and analyzing raw gate activations.<br />
<br />
Attention-based models:<br />
#Bahdanau et al. (2014): These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. But the problem with it is twofold. Firstly, it is only an indirect indicator of importance and does not provide directionality. Secondly, they have not been evaluated empirically or otherwise as an interpretation technique. Although they have been used in multiple other applications and architectures for solving a variety of problems.<br />
<br />
==Long Short-Term Memory Networks==<br />
Over the past few years, LSTM has become a core component of neural NLP systems and sequence modeling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference is that the modules are little more complicated. Instead of having a single tanh layer like in an RNN, they have four (called gates), interacting in a special way. Additionally, they have a cell state which runs through the entire chain of the network. It helps in managing the information from the previous cells in the chain.<br />
<br />
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.<br />
<br />
\begin{align}<br />
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o)<br />
\end{align}<br />
\begin{align}<br />
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f)<br />
\end{align}<br />
\begin{align}<br />
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i)<br />
\end{align}<br />
\begin{align}<br />
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g)<br />
\end{align}<br />
\begin{align}<br />
c_t = f_t \odot c_{t−1} + i_t \odot g_t<br />
\end{align}<br />
\begin{align}<br />
h_t = o_t \odot tanh(c_t)<br />
\end{align}<br />
<br />
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and <math> \odot </math> denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication. Intuitively we can think of the forget gate as how much previous memory(information) do we want to forget; input gate as controlling whether or not to let new input in; g gate controlling what do we want to add and finally the output gate as controlling how much the current information(at current time step) should flow out.<br />
<br />
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.<br />
<br />
\begin{align}<br />
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) }<br />
\end{align}<br />
<br />
==Contextual Decomposition(CD) of LSTM==<br />
CD decomposes the output of the LSTM into a sum of two contributions:<br />
# those resulting solely from the given phrase<br />
# those involving other factors<br />
<br />
One important thing that is crucial to understand is that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence and returns the vectors with the contributions.<br />
<br />
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.<br />
<br />
\begin{align}<br />
h_t = \beta_t + \gamma_t<br />
\end{align}<br />
\begin{align}<br />
c_t = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
In the decomposition <math>\beta_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase while <math> \gamma_t </math> denotes contribution atleast in part from the other factors. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and atleast in part from the other factors respectively.<br />
<br />
Using this decomposition the softmax function can be represented as follows<br />
\begin{align}<br />
p = SoftMax(W\beta_T + W\gamma_T)<br />
\end{align}<br />
<br />
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.<br />
<br />
===Disambiguation Interaction between gates===<br />
<br />
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.<br />
<br />
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have<br />
\begin{align}<br />
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\<br />
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)<br />
\end{align}<br />
<br />
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).<br />
<br />
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is contained within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>). In this summary, we will derive the equations for the former case. A very important thing to understand is that given any word/phrase within a sentence this algorithm would make a full pass over the LSTM to compute the 2 different contributions.<br />
<br />
So essentially what we are going to do now is linearize each of the gates, then expand the product of sums of these gates and then finally group the terms we get depending on which type of interaction they represent (solely from phase, solely from other factors and a combination of both).<br />
<br />
Terms are determined to be derived solely from the specific phrase if they involve products from some combination of <math>\beta_{t-1}, \beta_{t-1}^c, x_t</math> and <math> b_i </math> or <math> b_g </math>(but not both). In the other case when t is not within the phrase, products involving <math> x_t </math> are treated as not deriving from the phrase. This(the other case) can be observed by seeing the equations for this specific case in the appendix of the original paper.<br />
<br />
\begin{align}<br />
f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\<br />
& = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\<br />
& = \beta_t^f + \gamma_t^f<br />
\end{align}<br />
<br />
Similarly<br />
<br />
\begin{align}<br />
i_t\odot g_t &= [(L_\sigma(W_ix_t) + L_\sigma(V_i\beta_{t-1}) + L_\sigma(V_i\gamma_{t-1}) + L_\sigma(b_i))] \\<br />
& \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(V_g\gamma_{t-1}) + L_{tanh}(b_g))] \\<br />
& = ([L_\sigma(W_ix_t) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_i\beta_{t-1}) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(b_i) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1})] \\ <br />
&+ [L_\sigma(V_i\gamma_{t-1}) \odot g_t + i_t \odot L_{tanh}(V_g\gamma_{t-1}) - L_\sigma(V_i\gamma_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1}) \\ <br />
&+ L_\sigma(b_i) \odot L_{tanh}(b_g)] \\<br />
& = \beta_t^u + \gamma_t^u<br />
\end{align}<br />
<br />
Thus we can represent <math>c_t</math> as<br />
<br />
\begin{align}<br />
c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\<br />
& = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\<br />
& = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
So once we have the decomposition of <math> c_t </math>, then we can rather simply calculate the transformation of <math> h_t </math> by linearizing the <math> tanh</math> function. Again at this point, we just assume that a linearizing function for <math> tanh </math> exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally <math> h_t </math> can be written as<br />
<br />
\begin{align}<br />
h_t &= o_t \odot tanh(c_t) \\<br />
& = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\<br />
& = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\<br />
& = \beta_t + \gamma_t<br />
\end{align}<br />
<br />
===Linearizing activation functions ===<br />
<br />
This section will explain the big assumption that we took earlier about the linearizing functions <math> L_{\sigma} </math> and <math> L_{tanh} </math>. For arbitrary { <math> y_1, ..., y_N </math> } <math> \in R </math>, the problem that we intend to solve is essentially<br />
\begin{align}<br />
tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i)<br />
\end{align}<br />
<br />
In cases where {<math> y_i </math>} follow a natural ordering, work in Murdoch & Szlam, 2017 where the difference of partial sums is utilized as a linearization technique could be used. This could be shown by the equation below<br />
\begin{align}<br />
L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j)<br />
\end{align}<br />
<br />
But in our case the terms do not follow any particular ordering, for e.g. while calculating <math> i_t </math> we could write it as a sum of <math> W_ix_t, V_ih_{t−1}, b_i </math> or <math> b_i, V_ih_{t−1}, W_ix_t </math>. Thus, we average over all the possible orderings. Let <math>\pi_i, ..., \pi_{M_n} </math> denote the set of all permutations of <math>1, ..., N</math>, then the score could be given as below<br />
<br />
\begin{align}<br />
L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k) - 1}y_{\pi_i(j)})]<br />
\end{align}<br />
<br />
We can similarly derive <math> L_{\sigma} </math>. An important empirical observation to note here is that in the case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term.<br />
In our case, the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example of a case where N=2 is shown below.<br />
<br />
\begin{align}<br />
L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)])<br />
\end{align}<br />
<br />
==Experiments==<br />
As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper verifies the following 3 tasks with the experiments:<br />
# It should work on the standard problem of word-level importance scores<br />
# It should behave for words as well as phrases especially in situations involving compositionality.<br />
# It should be able to extract instances of positive and negative negation.<br />
<br />
An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The code can be found at "https://github.com/jamie-murdoch/ContextualDecomposition". The model was trained using Adam with the learning rate of 0.001 and using early stopping on the validation set. Additionally, a bag of words linear model was used.<br />
<br />
All the experiments were performed on the Stanford Sentiment Treebank(SST) [Socher et al., 2013] dataset and the Yelp Polarity(YP) [Zhang et al., 2015] dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. It is important to note that the SST dataset has one key feature that is perfect for this task, which is in addition to review-level labels, <br />
it also provides labels for each phrase in the binarized constituency parse tree. This enables us to examine that if the model can identify negative phrases out of a positive review, or vice versa. The word embedding used in LSTM is pretrained Glove vectors with length equal to 300, and the hidden representations of the LSTM is set to be 168. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In the case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.<br />
<br />
<br />
===Baselines===<br />
The interpretations are compared with 4 state-of-the-art baselines for interpretability.<br />
# Cell Decomposition(Murdoch & Szlam, 2017), <br />
# Integrated Gradients (Sundararajan et al., 2017),<br />
# Leave One Out (Li et al., 2016),<br />
# Gradient times input [gradient of the output probability with respect to the word embeddings is computed which is finally reported as a dot product with the word vector]<br />
<br />
To obtain phrase scores for word-based baselines integrated gradients, cell decomposition, and gradients, the paper sums the scores of the words contained within the phrase.<br />
<br />
<br />
===Unigram(Word) Scores===<br />
Logistic regression(LR) coefficients while being sufficiently accurate for prediction are considered the gold standard for interpretability. For the task of sentiment analysis, the importance of the words is given by their coefficient values. Thus we would expect the CD scores extracted from an LSTM, to have meaningful relationships and comparison with the logistic regression coefficients. This comparison is done using scatter plots(Fig 4) which measures the Pearson correlation coefficient between the importance scores extracted by LR coefficients and LSTM. This is done for multiple words which are represented as a point in the scatter plots. For SST, CD and Integrated Gradients, with correlations of 0.76 and 0.72, respectively, are substantially better than other methods, which have correlations of at most 0.51. On Yelp, the gap is not as big, but CD is still very competitive, having correlation 0.52 with other methods ranging from 0.34 to 0.56. The complete results are shown in Table 4.<br />
[[File:Dhruv Table4.png|600px|centre]]<br />
<br />
===Benefits===<br />
Having verified reasonably strong results in the base case, the paper then proceeds to show the benefits of CD.<br />
====Identifying Dissenting Subphrases====<br />
First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase(a phrase is considered to be of at most 5 words) with different sentiments. For example, consider the phrase "used to be my favorite". The word "favorite" is strongly positive which is also shown by it having a high linear regression coefficient. Nonetheless, the existing methods identify "favorite" as being highly negative or neutral in this context. However, as shown in table 1 CD is able to correctly identify it being strongly positive, and the subphrase "used to be" as highly negative. This particular identification is itself the main reason for using the LSTMs over other methods in text comprehension. Thus, it is quite important that an interpretation algorithm is able to properly uncover how these interactions are being handled. A search across the datasets is done to find similar cases where a negative phrase contains a positive sub-phrase and vice versa. Phrases are scored using the logistic regression over n-gram features and included if their overall score is over 1.5.<br />
[[File:Dhruv Table1.png|600px|centre]]<br />
It is to be noted that for an efficient interpretation algorithm the distribution of scores for these positive and negative dissenting subphrases should be significantly separate with the positive subphrases having positive scores and vice-versa. However, as shown in figure 2, this is not the case with the previous interpretation algorithms.<br />
<br />
====Examining High-Level Compositionality====<br />
The paper now studies the cases where a sizable portion of a review(between one and two-thirds) has a different polarity from the final sentiment. An example is shown in Table 2. SST contains phrase-level sentiment labels too. Therefore the authors conduct a search in SST where a sizable phrase in a review is of the opposite polarity than the review-level label. The figure shows the distribution of the resulting positive and negative phrases for different attribution methods. We should note that a successful interpretation method would have a sizeable gap between these two distributions. Notice that the previous methods fail to satisfy this criterion. The paper additionally provides a two-sample Kolmogorov-Smirnov one-sided test statistic, to quantify this difference in performance. This statistic is a common difference measure for the difference of distributions with values ranging from 0 to 1. As shown in Figure 3 CD gets a score of 0.74 while the other models achieve a score of 0(Cell decomposition), 0.33(Integrated Gradients), 0.58(Leave One Out) and 0.61(gradient). The methods leave one out and gradient perform relatively better than the other 2 baselines but they were the weakest performers in the unigram scores. This inconsistency in other methods performance further strengthens the superiority of CD.<br />
[[File:Dhruv Table2.png|600px|centre]]<br />
<br />
====Capturing Negation====<br />
The paper also shows a way to empirically show how the LSTMs capture negations in a sentence. To search for negations, the following list of negation words were used: not, n’t, lacks, nobody, nor, nothing, neither, never, none, nowhere, remotely. Again using the phrase labels present in SST, the authors search over the training set for instances of negation. Both the positive as well as negative negations are identified. For a given negation phrase, they extract a negation interaction by computing the CD score of the entire phrase and subtracting the CD scores of the phrase being negated and the negation term itself. The resulting score can be interpreted as an n-gram feature. Apart from CD, only leave one out is capable of producing such interaction scores. The distribution of extracted scores is presented in Figure 1. For CD there is a clear distinction between positive and negative negations. Leave one out is able to capture some of the interactions, but has a noticeable overlap between positive and negative negations around zero, indicating a high rate of false negatives.<br />
<br />
====Identifying Similar Phrases====<br />
A key aspect of the CD algorithm is that it helps us learn the value of <math> \beta_t </math> which is essentially a dense embedding vector for a word or a phrase. The way the authors did in the paper is to calculate the AVG(<math> \beta_t </math> ), and then, using similarity measures in the embedding space(eg. cosine similarity) we can easily find similar phrases/words given a phrase/word. The results as shown in Table 3 are qualitatively sensible for 3 different kinds of interactions: positive negation, negative negation and modification, as well as positive and negative words.<br />
[[File:Dhruv Table3.png|600px|centre]]<br />
<br />
==Conclusions==<br />
The paper provides an algorithm called Contextual Decomposition(CD) to interpret predictions made by LSTMs without modifying their architecture. It takes in a trained LSTM and breaks it down in components and quantifies the interpretability of its decision. In both NLP and<br />
in general applications CD produces importance scores for words (single variables in general), phrases (several variables together) and word interactions (variable interactions). It also compares the algorithm with state-of-the-art baselines and shows that it performs favorably. It also shows that CD is capable of identifying phrases of varying sentiment and extracting meaningful word (or variable) interactions. It shows the shortcomings of the traditional word-based interpretability approaches for understanding LSTMs and advances the state-of-the-art.<br />
<br />
==Critique/Future Work==<br />
While the method itself is novel in the sense that it moves past the traditional approach of looking just at word level importance scores; it only looks at one specific architecture which is applied to a very simple problem. The authors don't talk about any future directions for this work in the paper itself but a discussion about it happened during the Oral presentation of the paper at ICLR 2018. Following are the importance points:<br />
#We could look at interpreting a more complex model, for example, say seq2seq. The author pointed out that he was affirmative that this model could be extended for such purposes although the computational complexity would increase since we would be predicting multiple outputs in this case.<br />
#We could also look at whether this approach could be generalized to completely different architectures like CNN. As of now given a new model, we need to manually work out the math for the specific model. Could we develop some general approach towards this? Although the author pointed out that they are working towards using this approach to interpret CNNs.<br />
<br />
==References==<br />
W. James Murdoch, Peter J. Liu, Bin Yu. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. ICLR 2018<br />
<br />
Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural Computation, 9(8): 1735–1780, 1997.<br />
<br />
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.<br />
<br />
Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016. URL http://arxiv.org/abs/1612.08220.<br />
<br />
W James Murdoch and Arthur Szlam. Automatic rule extraction from long short-term memory networks. ICLR, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.<br />
<br />
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017<br />
<br />
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.<br />
<br />
Hendrik Strobelt, Sebastian Gehrmann, Bernd Huber, Hanspeter Pfister, and Alexander M Rush. Visual analysis of hidden state dynamics in recurrent neural networks. arXiv preprint arXiv:1606.07461, 2016.<br />
<br />
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017. URL http://arxiv.org/abs/1703.01365<br />
<br />
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657, 2015<br />
<br />
==Appendix==<br />
<br />
[[File:Dhruv Figure4.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure2.png|600px|right]]<br />
<br />
<br />
[[File:Dhruv Figure3.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure1.png|600px|right]]</div>D35kumarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs&diff=36932stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs2018-10-20T01:22:02Z<p>D35kumar: /* Disambiguation Interaction between gates */</p>
<hr />
<div>== Introduction ==<br />
The main reason behind the recent success of LSTM (Long Short-Term Memory Networks) and deep neural networks, in general, has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition(CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model.<br />
<br />
==Overview of previous work==<br />
<br />
There has been research towards developing methods to understand the evaluations provided by LSTMs. Some of them are in line with the work done in this particular paper while others have followed some different approaches.<br />
<br />
Approaches similar to the one provided in this paper - All of the approaches in this category have tried to look into computing just the word-level importance scores with varying evaluation methods.<br />
#Murdoch & Szlam (2017): introduced a decomposition of the LSTM's output embedding and learned the importance score of certain words and phrases from those words (sum of the importance of words). A classifier is then built that searches for these learned phrases that are important and predicts the associated class which is then compared with the output of the LSTMs for validation.<br />
#Li et al. (2016): (Leave one out) They observed the change in log probability of a function by replacing a word vector (with zero) and studying the change in the prediction of the LSTM. It is completely anecdotal. Additionally provided an RL model to find the minimal set of words that must be erased to change the model's decision (although this is irrelevant to interpretability).<br />
#Sundararajan et al. 2017 (Integrated Gradients): a general gradient based technique to learn importance evaluated theoretically and empirically. Built up as an improvement to methods which were trying to do something quite similar. It is tested on image, text and chemistry models.<br />
<br />
Decomposition-based approaches for CNN:<br />
#Bach et al. 2015: proposed a solution to the problem of understanding classification decisions of CNNs by pixel-wise decomposition. Pixel contributions are visualized as heat maps.<br />
#Shrikumar et al. 2016 (DeepLift): an algorithm based on a method similar to backpropagation to learn the importance score for the inputs for a given output. The algorithm to learn these input scores is not dependent on the gradient therefore learning can also happen if the gradient is zero during backpropagation.<br />
<br />
Focussing on analyzing the gate activations:<br />
#Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. For eg. a cell being activated for keeping track of open parathesis or quotes.<br />
#Strobelt et al. 2016: built a visual tool for understanding and analyzing raw gate activations.<br />
<br />
Attention-based models:<br />
#Bahdanau et al. (2014): These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. But the problem with it is twofold. Firstly, it is only an indirect indicator of importance and does not provide directionality. Secondly, they have not been evaluated empirically or otherwise as an interpretation technique. Although they have been used in multiple other applications and architectures for solving a variety of problems.<br />
<br />
==Long Short-Term Memory Networks==<br />
Over the past few years, LSTM has become a core component of neural NLP systems and sequence modeling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference is that the modules are little more complicated. Instead of having a single tanh layer like in an RNN, they have four (called gates), interacting in a special way. Additionally, they have a cell state which runs through the entire chain of the network. It helps in managing the information from the previous cells in the chain.<br />
<br />
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.<br />
<br />
\begin{align}<br />
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o)<br />
\end{align}<br />
\begin{align}<br />
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f)<br />
\end{align}<br />
\begin{align}<br />
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i)<br />
\end{align}<br />
\begin{align}<br />
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g)<br />
\end{align}<br />
\begin{align}<br />
c_t = f_t \odot c_{t−1} + i_t \odot g_t<br />
\end{align}<br />
\begin{align}<br />
h_t = o_t \odot tanh(c_t)<br />
\end{align}<br />
<br />
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and <math> \odot </math> denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication. Intuitively we can think of the forget gate as how much previous memory(information) do we want to forget; input gate as controlling whether or not to let new input in; g gate controlling what do we want to add and finally the output gate as controlling how much the current information(at current time step) should flow out.<br />
<br />
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.<br />
<br />
\begin{align}<br />
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) }<br />
\end{align}<br />
<br />
==Contextual Decomposition(CD) of LSTM==<br />
CD decomposes the output of the LSTM into a sum of two contributions:<br />
# those resulting solely from the given phrase<br />
# those involving other factors<br />
<br />
One important thing that is crucial to understand is that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence and returns the vectors with the contributions.<br />
<br />
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.<br />
<br />
\begin{align}<br />
h_t = \beta_t + \gamma_t<br />
\end{align}<br />
\begin{align}<br />
c_t = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
In the decomposition <math>\beta_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase while <math> \gamma_t </math> denotes contribution atleast in part from the other factors. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and atleast in part from the other factors respectively.<br />
<br />
Using this decomposition the softmax function can be represented as follows<br />
\begin{align}<br />
p = SoftMax(W\beta_T + W\gamma_T)<br />
\end{align}<br />
<br />
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.<br />
<br />
===Disambiguation Interaction between gates===<br />
<br />
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.<br />
<br />
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have<br />
\begin{align}<br />
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\<br />
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)<br />
\end{align}<br />
<br />
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).<br />
<br />
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is contained within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>). In this summary, we will derive the equations for the former case. A very important