http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=H5tahir&feedformat=atomstatwiki - User contributions [US]2022-01-29T03:34:12ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dynamic_Routing_Between_Capsules_STAT946&diff=36157Dynamic Routing Between Capsules STAT9462018-04-04T14:53:03Z<p>H5tahir: /* MultiMNIST */</p>
<hr />
<div>= Presented by =<br />
<br />
Yang, Tong(Richard)<br />
<br />
= Contributions =<br />
<br />
This paper introduces the concept of "capsules" and an approach to implement its concept in neural networks. Capsules are a group of neurons used to represent various properties of an entity/object present in the image, such as pose, deformation, and even the existence of the entity. Instead of the obvious representation of a logistic unit for the probability of existence, the paper explores using the length of the capsule output vector to represent existence, and the orientation to represent other properties of the entity. The paper has the following major contributions:<br />
<br />
* Proposed an alternative approach to max-pooling, which is called routing-by-agreement.<br />
* Demonstrated an mathematical structure for capsule layers and routing mechanism that builds a prototype architecture for capsule networks. <br />
* Presented the promising results of CapsNet that confirms its value as a new direction for development in deep learning.<br />
<br />
= Hinton's Critiques on CNN =<br />
<br />
In the past talk, Hinton tried to explained why max-pooling is the biggest problem in current convolutional network structure, here are some highlights from his talk. <br />
<br />
== Four arguments against pooling ==<br />
<br />
* It is a bad fit to the psychology of shape perception: It does not explain why we assign intrinsic coordinate frames to objects and why they have such huge effects.<br />
<br />
* It solves the wrong problem: We want equivariance, not invariance. Disentangling rather than discarding.<br />
<br />
* It fails to use the underlying linear structure: It does not make use of the natural linear manifold that perfectly handles the largest source of variance in images.<br />
<br />
* Pooling is a poor way to do dynamic routing: We need to route each part of the input to the neurons that know how to deal with it. Finding the best routing is equivalent to parsing the image.<br />
<br />
===Intuition Behind Capsules ===<br />
We try to achieve viewpoint invariance in the activities of neurons by doing max-pooling. Invariance here means that by changing the input a little, the output still stays the same while the activity is just the output signal of a neuron. In other words, when in the input image we shift the object that we want to detect by a little bit, networks activities (outputs of neurons) will not change because of max pooling and the network will still detect the object. But the spacial relationships are not taken care of in this approach so instead capsules are used, because they encapsulate all important information about the state of the features they are detecting in a form of a vector. Capsules encode probability of detection of a feature as the length of their output vector. And the state of the detected feature is encoded as the direction in which that vector points to. So when detected feature moves around the image or its state somehow changes, the probability still stays the same (length of vector does not change), but its orientation changes.<br />
<br />
== Equivariance ==<br />
<br />
To deal with the invariance problem of CNN, Hinton proposes the concept called equivariance, which is the foundation of capsule concept.<br />
<br />
=== Two types of equivariance ===<br />
<br />
==== Place-coded equivariance ====<br />
If a low-level part moves to a very different position it will be represented by a different capsule.<br />
<br />
==== Rate-coded equivariance ====<br />
If a part only moves a small distance it will be represented by the same capsule but the pose outputs of the capsule will change.<br />
<br />
Higher-level capsules have bigger domains so low-level place-coded equivariance gets converted into high-level rate-coded equivariance.<br />
<br />
= Dynamic Routing =<br />
<br />
In the second section of this paper, authors give a mathematical representations for two key features in routing algorithm in capsule network, which are squashing and agreement. The general setting for this algorithm is between two arbitrary capsules i and j. Capsule j is assumed to be an arbitrary capsule from the first layer of capsules, and capsule i is an arbitrary capsule from the layer below. The purpose of routing algorithm is generate a vector output for routing decision between capsule j and capsule i. Furthermore, this vector output will be used in the decision for choice of dynamic routing. <br />
<br />
== Routing Algorithm ==<br />
<br />
The routing algorithm is as the following:<br />
<br />
[[File:DRBC_Figure_1.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
In the following sections, each part of this algorithm will be explained in details.<br />
<br />
=== Log Prior Probability ===<br />
<br />
<math>b_{ij}</math> represents the log prior probabilities that capsule i should be coupled to capsule j, and updated in each routing iteration. As line 2 suggests, the initial values of <math>b_{ij}</math> for all possible pairs of capsules are set to 0. In the very first routing iteration, <math>b_{ij}</math> equals to zero. For each routing iteration, <math>b_{ij}</math> gets updated by the value of agreement, which will be explained later.<br />
<br />
=== Coupling Coefficient === <br />
<br />
<math>c_{ij}</math> represents the coupling coefficient between capsule j and capsule i. It is calculated by applying the softmax function on the log prior probability <math>b_{ij}</math>. The mathematical transformation is shown below (Equation 3 in paper): <br />
<br />
\begin{align}<br />
c_{ij} = \frac{exp(b_ij)}{\sum_{k}exp(b_ik)}<br />
\end{align}<br />
<br />
<math>c_{ij}</math> are served as weights for computing the weighted sum and probabilities. Therefore, as probabilities, they have the following properties:<br />
<br />
\begin{align}<br />
c_{ij} \geq 0, \forall i, j<br />
\end{align}<br />
<br />
and, <br />
<br />
\begin{align}<br />
\sum_{i,j}c_{ij} = 1, \forall i, j<br />
\end{align}<br />
<br />
=== Predicted Output from Layer Below === <br />
<br />
<math>u_{i}</math> are the output vector from capsule i in the lower layer, and <math>\hat{u}_{j|i}</math> are the input vector for capsule j, which are the "prediction vectors" from the capsules in the layer below. <math>\hat{u}_{j|i}</math> is produced by multiplying <math>u_{i}</math> by a weight matrix <math>W_{ij}</math>, such as the following:<br />
<br />
\begin{align}<br />
\hat{u}_{j|i} = W_{ij}u_i<br />
\end{align}<br />
<br />
where <math>W_{ij}</math> encodes some spatial relationship between capsule j and capsule i.<br />
<br />
=== Capsule ===<br />
<br />
By using the definitions from previous sections, the total input vector for an arbitrary capsule j can be defined as:<br />
<br />
\begin{align}<br />
s_j = \sum_{i}c_{ij}\hat{u}_{j|i}<br />
\end{align}<br />
<br />
which is a weighted sum over all prediction vectors by using coupling coefficients.<br />
<br />
=== Squashing ===<br />
<br />
The length of <math>s_j</math> is arbitrary, which is needed to be addressed with. The next step is to convert its length between 0 and 1, since we want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The "squashing" process is shown below:<br />
<br />
\begin{align}<br />
v_j = \frac{||s_j||^2}{1+||s_j||^2}\frac{s_j}{||s_j||}<br />
\end{align}<br />
<br />
Notice that "squashing" is not just normalizing the vector into unit length. In addition, it does extra non-linear transformation to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1. The reason for doing this is to make decision of routing, which is called "routing by agreement" much easier to make between capsule layers.<br />
<br />
=== Agreement ===<br />
<br />
The final step of a routing iteration is to form an routing agreement <math>a_{ij}</math>, which is represents as a scalar product:<br />
<br />
\begin{align}<br />
a_{ij} = v_{j} \cdot \hat{u}_{j|i}<br />
\end{align}<br />
<br />
As we mentioned in "squashing" section, the length of <math>v_{j}</math> is either close to 0 or close to 1, which will effect the magnitude of <math>a_{ij}</math> in this case. Therefore, the magnitude of <math>a_{ij}</math> indicate the how strong the routing algorithm agrees on taking the route between capsule j and capsule i. For each routing iteration, the log prior probability, <math>b_{ij}</math> will be updated by adding the value of its agreement value, which will effect how the coupling coefficients are computed in the next routing iteration. Because of the "squashing" process, we will eventually end up with a capsule j with its <math>v_{j}</math> close to 1 while all other capsules with its <math>v_{j}</math> close to 0, which indicates that this capsule j should be activated.<br />
<br />
= CapsNet Architecture =<br />
<br />
The second part of this paper discuss the experiment results from a 3-layer CapsNet, the architecture can be divided into two parts, encoder and decoder. <br />
<br />
== Encoder == <br />
<br />
[[File:DRBC_Architecture.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
=== How many routing iteration to use? === <br />
In appendix A of this paper, the authors have shown the empirical results from 500 epochs of training at different choice of routing iterations. According to their observation, more routing iterations increases the capacity of CapsNet but tends to bring additional risk of overfitting. Moreover, CapsNet with routing iterations less than three are not effective in general. As result, they suggest 3 iterations of routing for all experiments.<br />
<br />
=== Marginal loss for digit existence ===<br />
<br />
The experiments performed include segmenting overlapping digits on MultiMINST data set, so the loss function has be adjusted for presents of multiple digits. The marginal lose <math>L_k</math> for each capsule k is calculate by:<br />
<br />
\begin{align}<br />
L_k = T_k max(0, m^+ - ||v_k||)^2 + \lambda(1 - T_k) max(0, ||v_k|| - m^-)^2<br />
\end{align}<br />
<br />
where <math>m^+ = 0.9</math>, <math>m^- = 0.1</math>, and <math>\lambda = 0.5</math>.<br />
<br />
<math>T_k</math> is an indicator for presence of digit of class k, it takes value of 1 if and only if class k is presented. If class k is not presented, <math>\lambda</math> down-weight the loss which shrinks the lengths of the activity vectors for all the digit capsules. By doing this, The loss function penalizes the initial learning for all absent digit class, since we would like the top-level capsule for digit class k to have long instantiation vector if and only if that digit class is present in the input.<br />
<br />
=== Layer 1: Conv1 === <br />
<br />
The first layer of CapsNet. Similar to CNN, this is just convolutional layer that converts pixel intensities to activities of local feature detectors. <br />
<br />
* Layer Type: Convolutional Layer.<br />
* Input: <math>28 \times 28</math> pixels.<br />
* Kernel size: <math>9 \times 9</math>.<br />
* Number of Kernels: 256.<br />
* Activation function: ReLU.<br />
* Output: <math>20 \times 20 \times 256</math> tensor.<br />
<br />
=== Layer 2: PrimaryCapsules ===<br />
<br />
The second layer is formed by 32 primary 8D capsules. By 8D, it means that each primary capsule contains 8 convolutional units with a <math>9 \times 9</math> kernel and a stride of 2. Each capsule will take a <math>20 \times 20 \times 256</math> tensor from Conv1 and produce an output of a <math>6 \times 6 \times 8</math> tensor.<br />
<br />
* Layer Type: Convolutional Layer<br />
* Input: <math>20 \times 20 \times 256</math> tensor.<br />
* Number of capsules: 32.<br />
* Number of convolutional units in each capsule: 8.<br />
* Size of each convolutional unit: <math>6 \times 6</math>.<br />
* Output: <math>6 \times 6 \times 8</math> 8-dimensional vectors.<br />
<br />
=== Layer 3: DigitsCaps ===<br />
<br />
The last layer has 10 16D capsules, one for each digit. Not like the PrimaryCapsules layer, this layer is fully connected. Since this is the top capsule layer, dynamic routing mechanism will be applied between DigitsCaps and PrimaryCapsules. The process begins by taking a transformation of predicted output from PrimaryCapsules layer. Each output is a 8-dimensional vector, which needed to be mapped to a 16-dimensional space. Therefore, the weight matrix, <math>W_{ij}</math> is a <math>8 \times 16</math> matrix. The next step is to acquire coupling coefficients from routing algorithm and to perform "squashing" to get the output. <br />
<br />
* Layer Type: Fully connected layer.<br />
* Input: <math>6 \times 6 \times 8</math> 8-dimensional vectors.<br />
* Output: <math>16 \times 10 </math> matrix.<br />
<br />
=== The loss function ===<br />
<br />
The output of the loss function would be a ten-dimensional one-hot encoded vector with 9 zeros and 1 one at the correct position.<br />
<br />
<br />
== Regularization Method: Reconstruction ==<br />
<br />
This is regularization method introduced in the implementation of CapsNet. The method is to introduce a reconstruction loss (scaled down by 0.0005) to margin loss during training. The authors argue this would encourage the digit capsules to encode the instantiation parameters the input digits. All the reconstruction during training is by using the true labels of the image input. The results from experiments also confirms that adding the reconstruction regularizer enforces the pose encoding in CapsNet and thus boots the performance of routing procedure. <br />
<br />
=== Decoder ===<br />
<br />
The decoder consists of 3 fully connected layers, each layer maps pixel intensities to pixel intensities. The number of parameters in each layer and the activation functions used are indicated in the figure below:<br />
<br />
[[File:DRBC_Decoder.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
=== Result ===<br />
<br />
The authors includes some results for CapsNet classification test accuracy to justify the result of reconstruction. We can see that for CapsNet with 1 routing iteration and CapsNet with 3 routing iterations, implement reconstruction shows significant improvements in both MINIST and MultiMINST data set. These improvements show the importance of routing and reconstruction regularizer. <br />
<br />
[[File:DRBC_Reconstruction.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
= Experiment Results for CapsNet = <br />
<br />
In this part, the authors demonstrate experiment results of CapsNet on different data sets, such as MINIST and different variation of MINST, such as expanded MINST, affNIST, MultiMNIST. Moreover, they also briefly discuss the performance on some other popular data set such CIFAR 10. <br />
<br />
== MINST ==<br />
<br />
=== Highlights ===<br />
<br />
* CapsNet archives state-of-the-art performance on MINST with significantly fewer parameters (3-layer baseline CNN model has 35.4M parameters, compared to 8.2M for CapsNet with reconstruction network).<br />
* CapsNet with shallow structure (3 layers) achieves performance that only achieves by deeper network before.<br />
<br />
=== Interpretation of Each Capsule ===<br />
<br />
The authors suggest that they found evidence that dimension of some capsule always captures some variance of the digit, while some others represents the global combinations of different variations, this would open some possibility for interpretation of capsules in the future. After computing the activity vector for the correct digit capsule, the authors fed perturbed versions of those activity vectors to the decoder to examine the effect on reconstruction. Some results from perturbations are shown below, where each row represents the reconstructions when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 from the range [-0.25, 0.25]: <br />
<br />
[[File:DRBC_Dimension.png|650px|center||Source: Sabour, Frosst, Hinton, 2017]]<br />
<br />
== affNIST == <br />
<br />
affNIT data set contains different affine transformation of original MINST data set. By the concept of capsule, CapsNet should gain more robustness from its equivariance nature, and the result confirms this. Compare the baseline CNN, CapsNet achieves 13% improvement on accuracy.<br />
<br />
== MultiMNIST ==<br />
<br />
The MultiMNIST is basically the overlapped version of MINIST. An important point to notice here is that this data set is generated by overlaying a digit on top of another digit from the same set but different class. In other words, the case of stacking digits from the same class is not allowed in MultiMINST. For example, stacking a 5 on a 0 is allowed, but stacking a 5 on another 5 is not. The reason is that CapsNet suffers from the "crowding" effect which will be discussed in the weakness of CapsNet section.<br />
<br />
The architecture used for the training is same as the one used for MNIST dataset. However, decay step of the learning rate is 10x larger to account for the larger dataset. Even with the overlap in MultiMNIST, the network is able to segment both digits separately and it shows that the network is able to position and style of the object in the image.<br />
<br />
[[File:multimnist.PNG | 700px|thumb|center|This figure shows some sample reconstructions on the MultiMNIST dataset using CapsNet. CapsNet reconstructs both of the digits in the image in different colours (green and red). It can be seen that the right most images have incorrect classifications with the 9 being classified as a 0 and the 7 being classified as an 8. ]]<br />
<br />
== Other data sets ==<br />
<br />
CapsNet is used on other data sets such as CIFAR10, smallNORB and SVHN. The results are not comparable with state-of-the-art performance, but it is still promising since this architecture is the very first, while other networks have been development for a long time. The authors pointed out one drawback of CapsNet is that they tend to account for everything in the input images - in the CIFAR10 dataset, the image backgrounds were too varied to model in a reasonably sized network, which partly explains the poorer results.<br />
<br />
= Conclusion = <br />
<br />
This paper discuss the specific part of capsule network, which is the routing-by-agreement mechanism. <br />
<br />
The authors suggest this is a great approach to solve the current problem with max-pooling in convolutional neural network. We see that the design of the capsule builds up upon the design of artificial neuron, but expands it to the vector form to allow for more powerful representational capabilities. It also introduces matrix weights to encode important hierarchical relationships between features of different layers. The result succeeds to achieve the goal of the designer: neuronal activity equivariance with respect to changes in inputs and invariance in probabilities of feature detection. <br />
<br />
Moreover, as author mentioned, the approach mentioned in this paper is only one possible implementation of the capsule concept. Approaches like [https://openreview.net/pdf?id=HJWLfGWRb/ this] have also been proposed to test other routing techniques.<br />
<br />
The preliminary results from experiment using a simple shallow CapsNet also demonstrate unparalleled performance that indicates the capsules are a direction worth exploring.<br />
<br />
= Weakness of Capsule Network =<br />
<br />
* Routing algorithm introduces internal loops for each capsule. As number of capsules and layers increases, these internal loops may exponentially expand the training time. <br />
* Capsule network suffers a perceptual phenomenon called "crowding", which is common for human vision as well. To address this weakness, capsules have to make a very strong representation assumption that at each location of the image, there is at most one instance of the type of entity that capsule represents. This is also the reason for not allowing overlaying digits from same class in generating process of MultiMINST.<br />
* Other criticisms include that the design of capsule networks requires domain knowledge or feature engineering, contrary to the abstraction-oriented goals of deep learning.<br />
<br />
= Implementations = <br />
1) Tensorflow Implementation : https://github.com/naturomics/CapsNet-Tensorflow<br />
<br />
2) Keras Implementation. : https://github.com/XifengGuo/CapsNet-Keras<br />
<br />
= References =<br />
# S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017<br />
# “XifengGuo/CapsNet-Keras.” GitHub, 14 Dec. 2017, github.com/XifengGuo/CapsNet-Keras. <br />
# “Naturomics/CapsNet-Tensorflow.” GitHub, 6 Mar. 2018, github.com/naturomics/CapsNet-Tensorflow.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:multimnist.PNG&diff=36156File:multimnist.PNG2018-04-04T14:48:20Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35861stat946w18/Tensorized LSTMs2018-03-28T15:29:00Z<p>H5tahir: /* Accuracy Analysis */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM call the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper has presented presented experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>G_t</math>:''' Activation of new content<br />
<br />
2. '''<math>I_t</math>:''' Input gate<br />
<br />
3. '''<math>F_t</math>:''' Forget gate<br />
<br />
4. '''<math>O_t</math>:''' Output gate<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px| This figure displays a visualization of the means of the diagonal channels of the tLSTM memory cells per task. The columns indicate the time steps and the rows indicate the diagonal locations. The values are normalized between 0 and 1.]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35860stat946w18/Tensorized LSTMs2018-03-28T15:26:18Z<p>H5tahir: /* Accuracy Analysis */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM call the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper has presented presented experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>G_t</math>:''' Activation of new content<br />
<br />
2. '''<math>I_t</math>:''' Input gate<br />
<br />
3. '''<math>F_t</math>:''' Forget gate<br />
<br />
4. '''<math>O_t</math>:''' Output gate<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px|Test ]]<br />
<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35859stat946w18/Tensorized LSTMs2018-03-28T15:24:44Z<p>H5tahir: /* Accuracy Analysis */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM call the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper has presented presented experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>G_t</math>:''' Activation of new content<br />
<br />
2. '''<math>I_t</math>:''' Input gate<br />
<br />
3. '''<math>F_t</math>:''' Forget gate<br />
<br />
4. '''<math>O_t</math>:''' Output gate<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
[[File:33_mnist.PNG |800px|center|Test]]<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35858stat946w18/Tensorized LSTMs2018-03-28T15:24:11Z<p>H5tahir: /* Accuracy Analysis */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM call the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper has presented presented experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>G_t</math>:''' Activation of new content<br />
<br />
2. '''<math>I_t</math>:''' Input gate<br />
<br />
3. '''<math>F_t</math>:''' Forget gate<br />
<br />
4. '''<math>O_t</math>:''' Output gate<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
[[File:33_mnist.PNG |800px|center||Figure 5: MNIST| Test]]<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35857stat946w18/Tensorized LSTMs2018-03-28T15:23:53Z<p>H5tahir: /* Accuracy Analysis */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM call the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper has presented presented experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>G_t</math>:''' Activation of new content<br />
<br />
2. '''<math>I_t</math>:''' Input gate<br />
<br />
3. '''<math>F_t</math>:''' Forget gate<br />
<br />
4. '''<math>O_t</math>:''' Output gate<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
[[File:33_mnist.PNG |800px|center||Figure 5: MNIST]]<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=35856stat946w18/Tensorized LSTMs2018-03-28T15:23:42Z<p>H5tahir: /* Accuracy Analysis */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers (illustrations will be provided later). <br />
<br />
<br />
However, usually the LSTM model introduces additional parameters, while LSTM with additional layers and wider layers increases the time required for model training and evaluation. As an alternative, the paper <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> has proposed a model based on LSTM call the Tensorized LSTM in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper has presented presented experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x1: t = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNN learns how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from to initial time-step to final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state ht is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^(R+M)xM </math> guarantees each hidden status provided by the previous step is of dimension M. <math> a_t ∈R^M </math> the hidden activation, and φ(·) the element-wise "tanh" function. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function, notes that the "Phi" is the element-wise function which produces some non-linearity and further generates another '''hidden status''', while the "Curly Phi" is applied to generates the '''output'''<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
However, one shortfall of RNN is the vanishing/exploding gradients. This shortfall is more significant especially when constructing long-range dependencies models. One alternative is to apply LSTM (Long Short-Term Memories), LSTMs alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Since LSTM is successfully in sequence models, it is natural to consider how to increase the complexity of the model to accommodate more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
Go through the methodology.<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>G_t</math>:''' Activation of new content<br />
<br />
2. '''<math>I_t</math>:''' Input gate<br />
<br />
3. '''<math>F_t</math>:''' Forget gate<br />
<br />
4. '''<math>O_t</math>:''' Output gate<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. We have two tasks on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
[[File:33_mnist.PNG |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches.<br />
<br />
= Critique(to be edited) =<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:33_mnist.PNG&diff=35855File:33 mnist.PNG2018-03-28T15:23:08Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35854Training And Inference with Integers in Deep Neural Networks2018-03-28T15:03:43Z<p>H5tahir: /* Bitwidth of Errors */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations [2] add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. Ternary weight networks (TWN) [3] and Trained ternary quantization (TTQ) [9] offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6]<br />
<br />
SVHN & CIFAR10: VGG variant [7]<br />
<br />
ImageNet: AlexNet variant [8]<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below and the error density for a single layer is compared with the Vanilla network.<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35853Training And Inference with Integers in Deep Neural Networks2018-03-28T15:03:05Z<p>H5tahir: /* Bitwidth of Errors */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations [2] add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. Ternary weight networks (TWN) [3] and Trained ternary quantization (TTQ) [9] offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6]<br />
<br />
SVHN & CIFAR10: VGG variant [7]<br />
<br />
ImageNet: AlexNet variant [8]<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.png|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Training_And_Inference_with_Integers_in_Deep_Neural_Networks&diff=35852Training And Inference with Integers in Deep Neural Networks2018-03-28T15:02:54Z<p>H5tahir: /* Bitwidth of Errors */</p>
<hr />
<div>== Introduction ==<br />
<br />
Deep neural networks have enjoyed much success in all manners of tasks, but it is common for these networks to be complicated, requiring large amounts of energy-intensive memory and floating-point operations. Therefore, in order to use state-of-the-art networks in applications where energy is limited or having packaging limitation for hardware, such as anything not connected to the power grid, the energy costs must be reduced while preserving as much performance as practical.<br />
<br />
Most existing methods focus on reducing the energy requirements during inference rather than training. Since training with SGD requires accumulation, training usually has higher precision demand than inference. Most of the existing methods focus on how to compress a model for inference, rather than during training. This paper proposes a framework to reduce complexity both during training and inference through the use of integers instead of floats. They address how to quantize all operations and operands as well as examining the bitwidth requirement for SGD computation & accumulation. Using integers instead of floats results in energy-savings because integer operations are more efficient than floating point (see the table below). Also, there already exists dedicated hardware for deep learning that uses integer operations (such as the 1st generation of Google TPU) so understanding the best way to use integers is well-motivated.<br />
{| class="wikitable"<br />
|+Rough Energy Costs in 45nm 0.9V [1]<br />
!<br />
! colspan="2" |Energy(pJ)<br />
! colspan="2" |Area(<math>\mu m^2</math>)<br />
|-<br />
!Operation<br />
!MUL<br />
!ADD<br />
!MUL<br />
!ADD<br />
|-<br />
|8-bit INT<br />
|0.2<br />
|0.03<br />
|282<br />
|36<br />
|-<br />
|16-bit FP<br />
|1.1<br />
|0.4<br />
|1640<br />
|1360<br />
|-<br />
|32-bit FP<br />
|3.7<br />
|0.9<br />
|7700<br />
|4184<br />
|}<br />
The authors call the framework WAGE because they consider how best to handle the '''W'''eights, '''A'''ctivations, '''G'''radients, and '''E'''rrors separately.<br />
<br />
== Related Work ==<br />
<br />
=== Weight and Activation ===<br />
Existing works to train DNNs on binary weights and activations [2] add noise to weights and activations as a form of regularization. The use of high-precision accumulation is required for SGD optimization since real-valued gradients are obtained from real-valued variables. Ternary weight networks (TWN) [3] and Trained ternary quantization (TTQ) [9] offer more expressive ability than binary weight networks by constraining the weights to be ternary-valued {-1,0,1} using two symmetric thresholds.<br />
<br />
=== Gradient Computation and Accumulation ===<br />
The DoReFa-Net quantizes gradients to low-bandwidth floating point numbers with discrete states in the backwards pass. In order to reduce the overhead of gradient synchronization in distributed training the TernGrad method quantizes the gradient updates to ternary values. In both works the weights are still stored and updated with float32, and the quantization of batch normalization and its derivative is ignored.<br />
<br />
== WAGE Quantization ==<br />
The core idea of the proposed method is to constrain the following to low-bitwidth integers on each layer:<br />
* '''W:''' weight in inference<br />
* '''a:''' activation in inference<br />
* '''e:''' error in backpropagation<br />
* '''g:''' gradient in backpropagation<br />
[[File:p32fig1.PNG|center|thumb|800px|Four operators QW (·), QA(·), QG(·), QE(·) added in WAGE computation dataflow to reduce precision, bitwidth of signed integers are below or on the right of arrows, activations are included in MAC for concision.]]<br />
The error and gradient are defined as:<br />
<br />
<math>e^i = \frac{\partial L}{\partial a^i}, g^i = \frac{\partial L}{\partial W^i}</math><br />
<br />
where L is the loss function.<br />
<br />
The precision in bits of the errors, activations, gradients, and weights are <math>k_E</math>, <math>k_A</math>, <math>k_G</math>, and <math>k_W</math> respectively. As shown in the above figure, each quantity also has a quantization operators to reduce bitwidth increases caused by multiply-accumulate (MAC) operations. Also, note that since this is a layer-by-layer approach, each layer may be followed or preceded by a layer with different precision, or even a layer using floating point math.<br />
<br />
=== Shift-Based Linear Mapping and Stochastic Mapping ===<br />
The proposed method makes use of a linear mapping where continuous, unbounded values are discretized for each bitwidth <math>k</math> with a uniform spacing of<br />
<br />
<math>\sigma(k) = 2^{1-k}, k \in Z_+ </math><br />
With this, the full quantization function is<br />
<br />
<math>Q(x,k) = Clip\left \{ \sigma(k) \cdot round\left [ \frac{x}{\sigma(k)} \right ], -1 + \sigma(k), 1 - \sigma(k) \right \}</math><br />
<br />
Note that this function is only using when simulating integer operations on floating-point hardware, on native integer hardware, this is done automatically. In addition to this quantization function.<br />
<br />
A distribution scaling factor is used in some quantization operators to preserve as much variance as possible when applying the quantization function above. The scaling factor is defined below.<br />
<br />
<math>Shift(x) = 2^{round(log_2(x))}</math><br />
<br />
Finally, stochastic rounding is substituted for small or real-valued updates during gradient accumulation.<br />
<br />
A visual representation of these operations is below.<br />
[[File:p32fig2.PNG|center|thumb|800px|Quantization methods used in WAGE. The notation <math>P, x, \lfloor \cdot \rfloor, \lceil \cdot \rceil</math> denotes probability, vector, floor and ceil, respectively. <math>Shift(\cdot)</math> refers to distribution shifting with a certain argument]]<br />
<br />
=== Weight Initialization ===<br />
In this work, batch normalization is simplified to a constant scaling layer in order to sidestep the problem of normalizing outputs without floating point math, and to remove the extra memory requirement with batch normalization. As such, some care must be taken when initializing weights. The authors use a modified initialization method base on MSRA [4].<br />
<br />
<math>W \thicksim U(-L, +L),L = max \left \{ \sqrt{6/n_{in}}, L_{min} \right \}, L_{min} = \beta \sigma</math><br />
<br />
<math>n_{in}</math> is the layer fan-in number, <math>U</math> denotes uniform distribution. The original<math>\eta</math> initialization method is modified by adding the condition that the distribution width should be at least <math>\beta \sigma</math>, where <math>\beta</math> is a constant greater than 1 and <math>\sigma</math> is the minimum step size see already. This prevents weights being initialised to all-zeros in the case where the bitwidth is low, or the fan-in number is high.<br />
<br />
=== Quantization Details ===<br />
<br />
==== Weight <math>Q_W(\cdot)</math> ====<br />
<math>W_q = Q_W(W) = Q(W, k_W)</math><br />
<br />
The quantization operator is simply the quantization function previously introduced. <br />
<br />
==== Activation <math>Q_A(\cdot)</math> ====<br />
The authors say that the variance of the weights passed through this function will be scaled compared to the variance of the weights as initialized. To prevent this effect from blowing up the network outputs, they introduce a scaling factor <math>\alpha</math>. Notice that it is constant for each layer.<br />
<br />
<math>\alpha = max \left \{ Shift(L_{min} / L), 1 \right \}</math><br />
<br />
The quantization operator is then<br />
<br />
<math>a_q = Q_A(a) = Q(a/\alpha, k_A)</math><br />
<br />
The scaling factor approximates batch normalization.<br />
<br />
==== Error <math>Q_E(\cdot)</math> ====<br />
The magnitude of the error can vary greatly, and that a previous approach (DoReFa-Net [5]) solves the issue by using an affine transform to map the error to the range <math>[-1, 1]</math>, apply quantization, and then applying the inverse transform. However, the authors claim that this approach still requires using float32, and that the magnitude of the error is unimportant: rather it is the orientation of the error. Thus, they only scale the error distribution to the range <math>\left [ -\sqrt2, \sqrt2 \right ]</math> and quantise:<br />
<br />
<math>e_q = Q_E(e) = Q(e/Shift(max\{|e|\}), k_E)</math><br />
<br />
Max is the element-wise maximum. Note that this discards any error elements less than the minimum step size.<br />
<br />
==== Gradient <math>Q_G(\cdot)</math> ====<br />
Similar to the activations and errors, the gradients are rescaled:<br />
<br />
<math>g_s = \eta \cdot g/Shift(max\{|g|\})</math><br />
<br />
<math> \eta </math> is a shift-based learning rate. It is an integer power of 2. The shifted gradients are represented in units of minimum step sizes <math> \sigma(k) </math>. When reducing the bitwidth of the gradients (remember that the gradients are coming out of a MAC operation, so the bitwidth may have increased) stochastic rounding is used as a substitute for small gradient accumulation.<br />
<br />
<math>\Delta W = Q_G(g) = \sigma(k_G) \cdot sgn(g_s) \cdot \left \{ \lfloor | g_s | \rfloor + Bernoulli(|g_s|<br />
- \lfloor | g_s | \rfloor) \right \}</math><br />
<br />
This randomly rounds the result of the MAC operation up or down to the nearest quantization for the given gradient bitwidth. The weights are updated with the resulting discrete increments:<br />
<br />
<math>W_{t+1} = Clip \left \{ W_t - \Delta W_t, -1 + \sigma(k_G), 1 - \sigma(k_G) \right \}</math><br />
<br />
=== Miscellaneous ===<br />
To train WAGE networks, the authors used pure SGD exclusively because more complicated techniques such as Momentum or RMSProp increase memory consumption and are complicated by the rescaling that happens within each quantization operator.<br />
<br />
The quantization and stochastic rounding are a form of regularization.<br />
<br />
The authors didn't use a traditional softmax with cross-entropy loss for the experiments because there does not yet exist a softmax layer for low-bit integers. Instead, they use a sum of squared error loss. This works for tasks with a small number of categories, but does not scale well.<br />
<br />
== Experiments ==<br />
For all experiments, the default layer bitwidth configuration is 2-8-8-8 for Weights, Activations, Gradients, and Error bits. The weight bitwidth is set to 2 because that results in ternary weights, and therefore no multiplication during inference. They authors argue that the bitwidth for activation and errors should be the same because the computation graph for each is similar and might use the same hardware. During training, the weight bitwidth is 8. For inference the weights are ternarized.<br />
<br />
=== Implementation Details ===<br />
MNIST: Network is LeNet-5 variant [6]<br />
<br />
SVHN & CIFAR10: VGG variant [7]<br />
<br />
ImageNet: AlexNet variant [8]<br />
{| class="wikitable"<br />
|+Test or validation error rates (%) in previous works and WAGE on multiple datasets. Opt denotes gradient descent optimizer, withM means SGD with momentum, BN represents batch normalization, 32 bit refers to float32, and ImageNet top-k format: top1/top5.<br />
!Method<br />
!<math>k_W</math><br />
!<math>k_A</math><br />
!<math>k_G</math><br />
!<math>k_E</math><br />
!Opt<br />
!BN<br />
!MNIST<br />
!SVHN<br />
!CIFAR10<br />
!ImageNet<br />
|-<br />
|BC<br />
|1<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|1.29<br />
|2.30<br />
|9.90<br />
|<br />
|-<br />
|BNN<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes <br />
|0.96<br />
|2.53<br />
|10.15<br />
|<br />
|-<br />
|BWN<br />
|1<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|<br />
|<br />
|<br />
|43.2/20.6<br />
|-<br />
|XNOR<br />
|1<br />
|1<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|<br />
|55.8/30.8<br />
|-<br />
|TWN<br />
|2<br />
|32<br />
|32<br />
|32<br />
|withM<br />
|yes<br />
|0.65<br />
|<br />
|7.44<br />
|'''34.7/13.8'''<br />
|-<br />
|TTQ<br />
|2<br />
|32<br />
|32<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|6.44<br />
|42.5/20.3<br />
|-<br />
|DoReFa<br />
|8<br />
|8<br />
|32<br />
|8<br />
|Adam<br />
|yes<br />
|<br />
|2.30<br />
|<br />
|47.0/<br />
|-<br />
|TernGrad<br />
|32<br />
|32<br />
|2<br />
|32<br />
|Adam<br />
|yes<br />
|<br />
|<br />
|14.36<br />
|42.4/19.5<br />
|-<br />
|WAGE<br />
|2<br />
|8<br />
|8<br />
|8<br />
|SGD<br />
|no<br />
|'''0.40'''<br />
|'''1.92'''<br />
|'''6.78'''<br />
|51.6/27.8<br />
|}<br />
<br />
=== Training Curves and Regularization ===<br />
The authors compare the 2-8-8-8 WAGE configuration introduced above, a 2-8-f-f (meaning float32) configuration, and a completely floating point version on CIFAR10. The test error is plotted against epoch. For training these networks, the learning rate is divided by 8 at the 200th epoch and again at the 250th epoch.<br />
[[File:p32fig3.PNG|center|thumb|800px|Training curves of WAGE variations and a vanilla CNN on CIFAR10]]<br />
The convergence of the 2-8-8-8 has comparable convergence to the vanilla CNN and outperforms the 2-8-f-f variant. The authors speculate that this is because the extra discretization acts as a regularizer.<br />
<br />
=== Bitwidth of Errors ===<br />
The CIFAR10 test accuracy is plotted against bitwidth below<br />
[[File:p32fig4.PNG|center|thumb|520x522px|The 10 run accuracies of different <math>k_E</math>]]<br />
<br />
[[File:32_error.pnh|center|thumb|520x522px|Histogram of errors for Vanilla network and Wage network. After being quantized and shifted each layer, the error is reshaped and so most orientation information is retained. ]]<br />
<br />
=== Bitwidth of Gradients ===<br />
{| class="wikitable"<br />
|+Test error rates (%) on CIFAR10 with different <math>k_G</math><br />
!<math>k_G</math><br />
!2<br />
!3<br />
!4<br />
!5<br />
!6<br />
!7<br />
!8<br />
!9<br />
!10<br />
!11<br />
!12<br />
|-<br />
|error<br />
|54.22<br />
|51.57<br />
|28.22<br />
|18.01<br />
|11.48<br />
|7.61<br />
|6.78<br />
|6.63<br />
|6.43<br />
|6.55<br />
|6.57<br />
|}<br />
The authors also examined the effect of bitwidth on the ImageNet implementation.<br />
<br />
{| class="wikitable"<br />
|+Top-5 error rates (%) on ImageNet with different <math>k_G</math>and <math>k_E</math><br />
!Pattern<br />
!vanilla<br />
!28ff-BN<br />
!28ff<br />
!28f8<br />
!28C8<br />
!288C<br />
!2888<br />
|-<br />
|error<br />
|19.29<br />
|20.67<br />
|24.14<br />
|23.92<br />
|26.88<br />
|28.06<br />
|27.82<br />
|}<br />
Here, C denotes 12 bits (Hexidecimal) and BN refers to batch normalization being added.<br />
<br />
== Discussion ==<br />
The authors have a few areas they believe this approach could be improved.<br />
<br />
'''MAC Operation:''' The 2-8-8-8 configuration was chosen because the low weight bitwidth means there aren't any multiplication during inference. However, this does not remove the requirement for multiplication during training. 2-2-8-8 configuration satisfies this requirement, but it is difficult to train and detrimental to the accuracy.<br />
<br />
'''Non-linear Quantization:''' The linear mapping used in this approach is simple, but there might be a more effective mapping. For example, a logarithmic mapping could be more effective if the weights and activations have a log-normal distribution.<br />
<br />
'''Normalization:''' Normalization layers (softmax, batch normalization) were not used in this paper. Quantized versions are an area of future work<br />
<br />
== Conclusion ==<br />
<br />
A framework for training and inference without the use of floating-point representation is presented. Future work may further improve compression and memory requirements.<br />
== References ==<br />
<br />
# Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017-03-27). [http://arxiv.org/abs/1703.09039 "Efficient Processing of Deep Neural Networks: A Tutorial and Survey"]. arXiv:1703.09039 [cs].<br />
# Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre (2015-11-01). [http://arxiv.org/abs/1511.00363 "BinaryConnect: Training Deep Neural Networks with binary weights during propagations"]. arXiv:1511.00363 [cs].<br />
# Li, Fengfu; Zhang, Bo; Liu, Bin (2016-05-16). [http://arxiv.org/abs/1605.04711 "Ternary Weight Networks"]. arXiv:1605.04711 [cs].<br />
# He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-02-06). [http://arxiv.org/abs/1502.01852 "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"]. arXiv:1502.01852 [cs].<br />
# Zhou, Shuchang; Wu, Yuxin; Ni, Zekun; Zhou, Xinyu; Wen, He; Zou, Yuheng (2016-06-20). [http://arxiv.org/abs/1606.06160 "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients"]. arXiv:1606.06160 [cs].<br />
# Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (November 1998). [http://ieeexplore.ieee.org/document/726791/?reload=true "Gradient-based learning applied to document recognition"]. Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. ISSN 0018-9219.<br />
# Simonyan, Karen; Zisserman, Andrew (2014-09-04). [http://arxiv.org/abs/1409.1556 "Very Deep Convolutional Networks for Large-Scale Image Recognition"]. arXiv:1409.1556 [cs].<br />
# Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q., eds. [http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Advances in Neural Information Processing Systems 25 (PDF)]. Curran Associates, Inc. pp. 1097–1105.<br />
# Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:32_error.png&diff=35851File:32 error.png2018-03-28T14:57:32Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35394Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T16:31:21Z<p>H5tahir: /* Coarse Level Features Needed For Classification */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
Figure 3 depicts relative accuracies of the intermediate classifiers and shows that the accuracy of a classifier is highly correlated with its position in the network. It is easy to see, specifically with the case of ResNet, that the classifiers improve in a staircase pattern. All of the experiments were performed on Cifar-100 dataset and it can be seen that the intermediate classifiers perform worst than the final classifiers, thus highlighting the problem with the lack of coarse level features early on.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Multi-scale_Dense_Networks_for_Resource_Efficient_Image_Classification&diff=35393Multi-scale Dense Networks for Resource Efficient Image Classification2018-03-23T16:26:56Z<p>H5tahir: /* Coarse Level Features Needed For Classification */</p>
<hr />
<div>= Introduction = <br />
<br />
Multi-Scale Dense Networks, MSDNets, are designed to address the growing demand for efficient object recognition. The issue with existing recognition networks is that they are either:<br />
efficient networks, but don't do well on hard examples, or<br />
large networks that do well on all examples but require a large amount of resources<br />
<br />
In order to be efficient on all difficulties MSDNets propose a structure that can accurately output classifications for varying levels of computational requirements. The two cases that are used to evaluate the network are:<br />
Anytime Prediction: What is the best prediction the network can provide when suddenly prompted.<br />
Budget Batch Predictions: Given a maximum amount of computational resources how well does the network do on the batch.<br />
<br />
= Related Networks =<br />
<br />
== Computationally Efficient Networks ==<br />
<br />
Existing methods for refining an accurate network to be more efficient include weight pruning, quantization of weights (during or after training), and knowledge distillation, which trains smaller network to match teacher network.<br />
<br />
== Resource Efficient Networks == <br />
<br />
Unlike the above, resource efficient concepts consider limited resources as a part of the structure/loss.<br />
Examples of work in this area include: <br />
* Efficient variants to existing state of the art networks<br />
* Gradient boosted decision trees, which incorporate computational limitations into the training<br />
* Fractal nets<br />
* Adaptive computation time method<br />
<br />
== Related architectures ==<br />
<br />
MSDNets pull on concepts from a number of existing networks:<br />
* Neural fabrics and others, are used to quickly establish a low resolution feature map, which is integral for classification.<br />
* Deeply supervised nets, introduced the incorporation of multiple classifiers throughout the network<br />
* The feature concatenation method from DenseNets allows the later classifiers to not be disrupted by the weight updates from earlier classifiers.<br />
<br />
= Multi-Scale Dense Networks =<br />
<br />
== Integral Contributions ==<br />
<br />
The way MSDNets aims to provide efficient classification with varying computational costs is to create one network that outputs results at depths. While this may seem trivial, as intermediate classifiers can be inserted into any existing network, two major problems arise.<br />
<br />
=== Coarse Level Features Needed For Classification ===<br />
<br />
[[File:paper29 fig3.png | 700px|thumb|center]]<br />
<br />
Coarse level features are needed to gain context of scene. In typical CNN based networks, the features propagate from fine to coarse. Classifiers added to the early, fine featured, layers do not output accurate predictions due to the lack of context.<br />
<br />
To address this issue, MSDNets proposes an architecture in which uses multi scaled feature maps. The network is quickly formed to contain a set number of scales ranging from fine to coarse. These scales are propagated throughout, so that for the length of the network there are always coarse level features for classification and fine features for learning more difficult representations.<br />
<br />
=== Training of Early Classifiers Interferes with Later Classifiers ===<br />
<br />
When training a network containing intermediate classifiers, the training of early classifiers will cause the early layers to focus on features for that classifier. These learned features may not be as useful to the later classifiers and degrade their accuracy.<br />
<br />
MSDNets use dense connectivity to avoid this issue. By concatenating all prior layers to learn future layers, the gradient propagation is spread throughout the available features. This allows later layers to not be reliant on any single prior, providing opportunities to learn new features that priors have ignored.<br />
<br />
== Architecture ==<br />
<br />
[[File:MSDNet_arch.png | 700px|thumb|center|Left: the MSDNet architecture. Right: example calculations for each output given 3 scales and 4 layers.]]<br />
<br />
The architecture of an MSDNet is a structure of convolutions with a set number of layers and a set number of scales. Layers allow the network to build on the previous information to generate more accurate predictions, while the scales allow the network to maintain coarse level features throughout.<br />
<br />
The first layer is a special, mini-cnn-network, that quickly fills all required scales with features. The following layers are generated through the convolutions of the previous layers and scales.<br />
<br />
Each output at a given s scale is given by the convolution of all prior outputs of the same scale, and the strided-convolution of all prior outputs from the previous scale. <br />
<br />
The classifiers are run on the concatenation of all of the coarsest outputs from the preceding layers.<br />
<br />
=== Loss Function ===<br />
<br />
The loss is calculated as a weighted sum of each classifier's logistic loss. The weighted loss is taken as an average over a set of training samples. The weights can be determined from a budget of computational power, but results also show that setting all to 1 is also acceptable.<br />
<br />
=== Computational Limit Inclusion ===<br />
<br />
When running in a budgeted batch scenario, the network attempts to provide the best overall accuracy. To do this with a set limit on computational resources, it works to use less of the budget on easy detections in order to allow more time to be spent on hard ones. <br />
In order to facilitate this, the classifiers are designed to exit when the confidence of the classification exceeds a preset threshold. To determine the threshold for each classifier, <math>|D_{test}|\sum_{k}(q_k C_k) \leq B </math> must be true. Where <math>|D_{test}|</math> is the total number of test samples, <math>C_k</math> is the computational requirement to get an output from the <math>k</math>th classifier, and <math>q_k </math> is the probability that a sample exits at the <math>k</math>th classifier. Assuming that all classifiers have the same base probability, <math>q</math>, then <math>q_k</math> can be used to find the threshold.<br />
<br />
= Experiments = <br />
<br />
When evaluating on CIFAR-10 and CIFAR-100 ensembles and multi-classifier versions of ResNets and DenseNets, as well as FractalNet are used to compare with MSDNet. <br />
<br />
When evaluating on ImageNet ensembles and individual versions of ResNets and DenseNets are compared with MSDNets.<br />
<br />
== Anytime Prediction ==<br />
<br />
In anytime prediction MSDNets are shown to have highly accurate with very little budget, and continue to remain above the alternate methods as the budget increases.<br />
<br />
[[File:MSDNet_anytime.png | 700px|thumb|center|Accuracy of the anytime classification models.]]<br />
<br />
== Budget Batch ==<br />
<br />
For budget batch 3 MSDNets are designed with classifiers set-up for varying ranges of budget constraints. On both dataset options the MSDNets exceed all alternate methods with a fraction of the budget required.<br />
<br />
[[File:MSDNet_budgetbatch.png | 700px|thumb|center|Accuracy of the budget batch classification models.]]<br />
<br />
= Critique = <br />
<br />
The problem formulation and scenario evaluation were very well formulated, and according to independent reviews, the results were reproducible. Where the paper could improve is on explaining how to implement the threshold; it isn't very well explained how the use of the validation set can be used to set the threshold value.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:paper29_fig3.png&diff=35392File:paper29 fig3.png2018-03-23T16:26:21Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=35391Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-03-23T16:18:41Z<p>H5tahir: /* Entanglement of Pitch and Timbre */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. To train such a data expensive model the authors highlight the need for a large dataset much like imagenet for music. <br />
<br />
= Contributions =<br />
To solve the problem highlighted above the authors propose two main contributions of their paper: <br />
* Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning<br />
* NSynth: a large dataset of musical notes inspired by the emerging of large image datasets<br />
<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
The authors accomplish this by passing the raw audio throw the encoder to produce an embedding <math>Z = f(x) </math>, next the input is shifted and feed into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a NC dilated convolution. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. <br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
The authors attempted to train the baseline on multiple input: raw waveforms, FFT, and log magnitude of spectrum finding the latter to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6.<br />
<br />
= The NSynth Dataset =<br />
The NSynth dataset has 306 043 unique musical notes all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’<br />
* Family - The family class of instruments that produced each note. There is 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth.<br />
<br />
<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analyses as two very different sound can look quite similar in the respective pictorial representation. This is why the authors recommend all readers to listen to the created notes which can be sound here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
=== Qualitative Comparison ===<br />
In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in figure 2), and the attack (B in figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in figure 2). The authors do add that the WaveNet produces some strikes (L in figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note. <br />
<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Future Directions =<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. <br />
<br />
= Open Source Code base =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=35253MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-03-22T18:56:54Z<p>H5tahir: /* 2.5D Sketch Recovery */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce the re projection consistency between the 3D shape and the estimated sketch. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth and surface normal maps, the author’s design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:2-5d_example.PNG|700px|thumb|center|Results from the paper: Learning Non-Lambertian Object Intrinsics across ShapeNet Categories. The results show that neural networks can be trained to recover 2.5D information from an image. The top row predicts the albedo and the bottom row predicts the shading. It can be observed that the results are still blurry and the fine details are not fully recovered.]]<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, with the use of depth and silhouettes, as well as some papers enforcing differentiable 2D-3D constraints for joint training of deep networks. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in Figure 1. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step, shown in Figure 2, estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information. A ResNet-18 encoder-decoder network is used, with the encoder taking a 256 x 256 RGB image, producing 8 x 8 x 512 feature maps. The decoder is four sets of 5 x 5 convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem. The network architecture is inspired by the TL network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. The projected depth loss is defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are similar to x.<br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets.<br />
<br />
== ShapeNet ==<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. Quantitatively, the full model also achieves 0.57 IoU, higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. However, the authors claim that the IoU metric is sub-optimal for three reasons. First, there is no emphasis on details since the metric prefers models that predict mean shapes consistently. Second, all possible scales are searched during the IoU computation, making it less efficient. Third, PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, and computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|600px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
===<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= References =<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=35251MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-03-22T18:50:29Z<p>H5tahir: /* 2.5D Sketch Recovery */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce the re projection consistency between the 3D shape and the estimated sketch. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth and surface normal maps, the author’s design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
[[File:Example.jpg]]<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, with the use of depth and silhouettes, as well as some papers enforcing differentiable 2D-3D constraints for joint training of deep networks. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in Figure 1. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step, shown in Figure 2, estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information. A ResNet-18 encoder-decoder network is used, with the encoder taking a 256 x 256 RGB image, producing 8 x 8 x 512 feature maps. The decoder is four sets of 5 x 5 convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem. The network architecture is inspired by the TL network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. The projected depth loss is defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are similar to x.<br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets.<br />
<br />
== ShapeNet ==<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. Quantitatively, the full model also achieves 0.57 IoU, higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. However, the authors claim that the IoU metric is sub-optimal for three reasons. First, there is no emphasis on details since the metric prefers models that predict mean shapes consistently. Second, all possible scales are searched during the IoU computation, making it less efficient. Third, PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, and computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|600px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
===<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= References =<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:2-5d_example.PNG&diff=35250File:2-5d example.PNG2018-03-22T18:50:20Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MarrNet:_3D_Shape_Reconstruction_via_2.5D_Sketches&diff=35247MarrNet: 3D Shape Reconstruction via 2.5D Sketches2018-03-22T18:46:18Z<p>H5tahir: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
Humans are able to quickly recognize 3D shapes from images, even in spite of drastic differences in object texture, material, lighting, and background.<br />
<br />
[[File:marrnet_intro_image.png|700px|thumb|center|Objects in real images. The appearance of the same shaped object varies based on colour, texture, lighting, background, etc. However, the 2.5D sketches (e.g. depth or normal maps) of the object remain constant, and can be seen as an abstraction of the object which is used to reconstruct the 3D shape.]]<br />
<br />
In this work, the authors propose a novel end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape from images and also enforce the re projection consistency between the 3D shape and the estimated sketch. The two step approach makes the network more robust to differences in object texture, material, lighting and background. Based on the idea from [Marr, 1982] that human 3D perception relies on recovering 2.5D sketches, which include depth and surface normal maps, the author’s design an end-to-end trainable pipeline which they call MarrNet. MarrNet first estimates depth, normal maps, and silhouette, followed by a 3D shape. MarrNet uses an encoder-decoder structure for the sub-components of the framework. <br />
<br />
The authors claim several unique advantages to their method. Single image 3D reconstruction is a highly under-constrained problem, requiring strong prior knowledge of object shapes. As well, accurate 3D object annotations using real images are not common, and many previous approaches rely on purely synthetic data. However, most of these methods suffer from domain adaptation due to imperfect rendering.<br />
<br />
Using 2.5D sketches can alleviate the challenges of domain transfer. It is straightforward to generate perfect object surface normals and depths using a graphics engine. Since 2.5D sketches contain only depth, surface normal, and silhouette information, the second step of recovering 3D shape can be trained purely from synthetic data. As well, the introduction of differentiable constraints between 2.5D sketches and 3D shape makes it possible to fine-tune the system, even without any annotations.<br />
<br />
The framework is evaluated on both synthetic objects from ShapeNet, and real images from PASCAL 3D+, showing good qualitative and quantitative performance in 3D shape reconstruction.<br />
<br />
= Related Work =<br />
<br />
== 2.5D Sketch Recovery ==<br />
Researchers have explored recovering 2.5D information from shading, texture, and colour images in the past. More recently, the development of depth sensors has led to the creation of large RGB-D datasets, and papers on estimating depth, surface normals, and other intrinsic images using deep networks. While this method employs 2.5D estimation, the final output is a full 3D shape of an object.<br />
<br />
== Single Image 3D Reconstruction ==<br />
The development of large-scale shape repositories like ShapeNet has allowed for the development of models encoding shape priors for single image 3D reconstruction. These methods normally regress voxelized 3D shapes, relying on synthetic data or 2D masks for training. The formulation in the paper tackles domain adaptation better, since the network can be fine-tuned on images without any annotations.<br />
<br />
== 2D-3D Consistency ==<br />
Intuitively, the 3D shape can be constrained to be consistent with 2D observations. This idea has been explored for decades, with the use of depth and silhouettes, as well as some papers enforcing differentiable 2D-3D constraints for joint training of deep networks. In this work, this idea is exploited to develop differentiable constraints for consistency between the 2.5D sketches and 3D shape.<br />
<br />
= Approach =<br />
The 3D structure is recovered from a single RGB view using three steps, shown in Figure 1. The first step estimates 2.5D sketches, including depth, surface normal, and silhouette of the object. The second step, shown in Figure 2, estimates a 3D voxel representation of the object. The third step uses a reprojection consistency function to enforce the 2.5D sketch and 3D structure alignment.<br />
<br />
[[File:marrnet_model_components.png|700px|thumb|center|MarrNet architecture. 2.5D sketches of normals, depths, and silhouette are first estimated. The sketches are then used to estimate the 3D shape. Finally, re-projection consistency is used to ensure consistency between the sketch and 3D output.]]<br />
<br />
== 2.5D Sketch Estimation ==<br />
The first step takes a 2D RGB image and predicts the surface normal, depth, and silhouette of the object. The goal is to estimate intrinsic object properties from the image, while discarding non-essential information. A ResNet-18 encoder-decoder network is used, with the encoder taking a 256 x 256 RGB image, producing 8 x 8 x 512 feature maps. The decoder is four sets of 5 x 5 convolutional and ReLU layers, followed by four sets of 1 x 1 convolutional and ReLU layers. The output is 256 x 256 resolution depth, surface normal, and silhouette images.<br />
<br />
== 3D Shape Estimation ==<br />
The second step estimates a voxelized 3D shape using the 2.5D sketches from the first step. The focus here is for the network to learn the shape prior that can explain the input well, and can be trained on synthetic data without suffering from the domain adaptation problem. The network architecture is inspired by the TL network, and 3D-VAE-GAN, with an encoder-decoder structure. The normal and depth image, masked by the estimated silhouette, are passed into 5 sets of convolutional, ReLU, and pooling layers, followed by two fully connected layers, with a final output width of 200. The 200-dimensional vector is passed into a decoder of 5 convolutional and ReLU layers, outputting a 128 x 128 x 128 voxelized estimate of the input.<br />
<br />
== Re-projection Consistency ==<br />
The third step consists of a depth re-projection loss and surface normal re-projection loss. Here, <math>v_{x, y, z}</math> represents the value at position <math>(x, y, z)</math> in a 3D voxel grid, with <math>v_{x, y, z} \in [0, 1] ∀ x, y, z</math>. <math>d_{x, y}</math> denotes the estimated depth at position <math>(x, y)</math>, <math>n_{x, y} = (n_a, n_b, n_c)</math> denotes the estimated surface normal. Orthographic projection is used.<br />
<br />
[[File:marrnet_reprojection_consistency.png|700px|thumb|center|Reprojection consistency for voxels. Left and middle: criteria for depth and silhouettes. Right: criterion for surface normals]]<br />
<br />
=== Depths ===<br />
The voxel with depth <math>v_{x, y}, d_{x, y}</math> should be 1, while all voxels in front of it should be 0. The projected depth loss is defined as follows:<br />
<br />
<math><br />
L_{depth}(x, y, z)=<br />
\left\{<br />
\begin{array}{ll}<br />
v^2_{x, y, z}, & z < d_{x, y} \\<br />
(1 - v_{x, y, z})^2, & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
<math><br />
\frac{∂L_{depth}(x, y, z)}{∂v_{x, y, z}} =<br />
\left\{<br />
\begin{array}{ll}<br />
2v{x, y, z}, & z < d_{x, y} \\<br />
2(v_{x, y, z} - 1), & z = d_{x, y} \\<br />
0, & z > d_{x, y} \\<br />
\end{array}<br />
\right.<br />
</math><br />
<br />
When <math>d_{x, y} = \infty</math>, all voxels in front of it should be 0.<br />
<br />
=== Surface Normals ===<br />
Since vectors <math>n_{x} = (0, −n_{c}, n_{b})</math> and <math>n_{y} = (−n_{c}, 0, n_{a})</math> are orthogonal to the normal vector <math>n_{x, y} = (n_{a}, n_{b}, n_{c})</math>, they can be normalized to obtain <math>n’_{x} = (0, −1, n_{b}/n_{c})</math> and <math>n’_{y} = (−1, 0, n_{a}/n_{c})</math> on the estimated surface plane at <math>(x, y, z)</math>. The projected surface normal tried to guarantee voxels at <math>(x, y, z) ± n’_{x}</math> and <math>(x, y, z) ± n’_{y}</math> should be 1 to match the estimated normal. The constraints are only applied when the target voxels are inside the estimated silhouette.<br />
<br />
The projected surface normal loss is defined as follows, with <math>z = d_{x, y}</math>:<br />
<br />
<math><br />
L_{normal}(x, y, z) =<br />
(1 - v_{x, y-1, z+\frac{n_b}{n_c}})^2 + (1 - v_{x, y+1, z-\frac{n_b}{n_c}})^2 + <br />
(1 - v_{x-1, y, z+\frac{n_a}{n_c}})^2 + (1 - v_{x+1, y, z-\frac{n_a}{n_c}})^2<br />
</math><br />
<br />
Gradients along x are:<br />
<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x-1, y, z+\frac{n_a}{n_c}}} = 2(v_{x-1, y, z+\frac{n_a}{n_c}}-1)<br />
</math><br />
and<br />
<math><br />
\frac{dL_{normal}(x, y, z)}{dv_{x+1, y, z-\frac{n_a}{n_c}}} = 2(v_{x+1, y, z-\frac{n_a}{n_c}}-1)<br />
</math><br />
<br />
Gradients along y are similar to x.<br />
<br />
= Training =<br />
The 2.5D and 3D estimation components are first pre-trained separately on synthetic data from ShapeNet, and then fine-tuned on real images.<br />
<br />
For pre-training, the 2.5D sketch estimator is trained on synthetic ShapeNet depth, surface normal, and silhouette ground truth, using an L2 loss. The 3D estimator is trained with ground truth voxels using a cross-entropy loss.<br />
<br />
Reprojection consistency loss is used to fine-tune the 3D estimation using real images, using the predicted depth, normals, and silhouette. A straightforward implementation leads to shapes that explain the 2.5D sketches well, but lead to unrealistic 3D appearance due to overfitting.<br />
<br />
Instead, the decoder of the 3D estimator is fixed, and only the encoder is fine-tuned. The model is fine-tuned separately on each image for 40 iterations, which takes up to 10 seconds on the GPU. Without fine-tuning, testing time takes around 100 milliseconds. SGD is used for optimization with batch size of 4, learning rate of 0.001, and momentum of 0.9.<br />
<br />
= Evaluation =<br />
Qualitative and quantitative results are provided using different variants of the framework. The framework is evaluated on both synthetic and real images on three datasets.<br />
<br />
== ShapeNet ==<br />
Synthesized images of 6,778 chairs from ShapeNet are rendered from 20 random viewpoints. The chairs are placed in front of random background from the SUN dataset, and the RGB, depth, normal, and silhouette images are rendered using the physics-based renderer Mitsuba for more realistic images.<br />
<br />
=== Method ===<br />
MarrNet is trained without the final fine-tuning stage, since 3D shapes are available. A baseline is created that directly predicts the 3D shape using the same 3D shape estimator architecture with no 2.5D sketch estimation.<br />
<br />
=== Results ===<br />
The baseline output is compared to the full framework, and the figure below shows that MarrNet provides model outputs with more details and smoother surfaces than the baseline. Quantitatively, the full model also achieves 0.57 IoU, higher than the direct prediction baseline.<br />
<br />
[[File:marrnet_shapenet_results.png|700px|thumb|center|ShapeNet results.]]<br />
<br />
== PASCAL 3D+ ==<br />
Rough 3D models are provided from real-life images.<br />
<br />
=== Method ===<br />
Each module is pre-trained on the ShapeNet dataset, and then fine-tuned on the PASCAL 3D+ dataset. Three variants of the model are tested. The first is trained using ShapeNet data only with no fine-tuning. The second is fine-tuned without fixing the decoder. The third is fine-tuned with a fixed decoder.<br />
<br />
=== Results ===<br />
The figure below shows the results of the ablation study. The model trained only on synthetic data provides reasonable estimates. However, fine-tuning without fixing the decoder leads to impossible shapes from certain views. The third model keeps the shape prior, providing more details in the final shape.<br />
<br />
[[File:marrnet_pascal_3d_ablation.png|600px|thumb|center|Ablation studies using the PASCAL 3D+ dataset.]]<br />
<br />
Additional comparisons are made with the state-of-the-art (DRC) on the provided ground truth shapes. MarrNet achieves 0.39 IoU, while DRC achieves 0.34. However, the authors claim that the IoU metric is sub-optimal for three reasons. First, there is no emphasis on details since the metric prefers models that predict mean shapes consistently. Second, all possible scales are searched during the IoU computation, making it less efficient. Third, PASCAL 3D+ only has rough annotations, with only 10 CAD chair models for all images, and computing IoU with these shapes is not very informative. Instead, human studies are conducted and MarrNet reconstructions are preferred 74% of the time over DRC, and 42% of the time to ground truth. This shows how MarrNet produces nice shapes and also highlights the fact that ground truth shapes are not very good.<br />
<br />
[[File:human_studies.png|600px|thumb|center|Human preferences on chairs in PASCAL 3D+ (Xiang et al. 2014). The numbers show the percentage of how often humans prefered the 3D shape from DRC (state-of-the-art), MarrNet, or GT.]]<br />
<br />
<br />
[[File:marrnet_pascal_3d_drc_comparison.png|600px|thumb|center|Comparison between DRC and MarrNet results.]]<br />
<br />
Several failure cases are shown in the figure below. Specifically, the framework does not seem to work well on thin structures.<br />
<br />
[[File:marrnet_pascal_3d_failure_cases.png|500px|thumb|center|Failure cases on PASCAL 3D+. The algorithm cannot recover thin structures.]]<br />
<br />
===<br />
<br />
== IKEA ==<br />
This dataset contains images of IKEA furniture, with accurate 3D shape and pose annotations. Objects are often heavily occluded or truncated.<br />
<br />
=== Results ===<br />
Qualitative results are shown in the figure below. The model is shown to deal with mild occlusions in real life scenarios. Human studes show that MarrNet reconstructions are preferred 61% of the time to 3D-VAE-GAN.<br />
<br />
[[File:marrnet_ikea_results.png|700px|thumb|center|Results on chairs in the IKEA dataset, and comparison with 3D-VAE-GAN.]]<br />
<br />
== Other Data ==<br />
MarrNet is also applied on cars and airplanes. Shown below, smaller details such as the horizontal stabilizer and rear-view mirrors are recovered.<br />
<br />
[[File:marrnet_airplanes_and_cars.png|700px|thumb|center|Results on airplanes and cars from the PASCAL 3D+ dataset, and comparison with DRC.]]<br />
<br />
MarrNet is also jointly trained on three object categories, and successfully recovers the shapes of different categories. Results are shown in the figure below.<br />
<br />
[[File:marrnet_multiple_categories.png|700px|thumb|center|Results when trained jointly on all three object categories (cars, airplanes, and chairs).]]<br />
<br />
= Commentary =<br />
Qualitatively, the results look quite impressive. The 2.5D sketch estimation seems to distill the useful information for more realistic looking 3D shape estimation. The disentanglement of 2.5D and 3D estimation steps also allows for easier training and domain adaptation from synthetic data.<br />
<br />
As the authors mention, the IoU metric is not very descriptive, and most of the comparisons in this paper are only qualitative, mainly being human preference studies. A better quantitative evaluation metric would greatly help in making an unbiased comparison between different results.<br />
<br />
As seen in several of the results, the network does not deal well with objects that have thin structures, which is particularly noticeable with many of the chair arm rests. As well, looking more carefully at some results, it seems that fine-tuning only the 3D encoder does not seem to transfer well to unseen objects, since shape priors have already been learned by the decoder.<br />
<br />
= Conclusion =<br />
The proposed MarrNet employs a novel model to estimate 2.5D sketches for 3D shape reconstruction. The sketches are shown to improve the model’s performance, and make it easy to adapt to images across different domains and categories. Differentiable loss functions are created such that the model can be fine-tuned end-to-end on images without ground truth. The experiments show that the model performs well, and human studies show that the results are preferred over other methods.<br />
<br />
= References =<br />
# David Marr. Vision: A computational investigation into the human representation and processing of visual information. W. H. Freeman and Company, 1982.<br />
# Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.<br />
# JiajunWu, Chengkai Zhang, Tianfan Xue,William T Freeman, and Joshua B Tenenbaum. Learning a Proba- bilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In NIPS, 2016b.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT&diff=35222stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT2018-03-22T16:56:51Z<p>H5tahir: /* Conclusion */</p>
<hr />
<div>== Introduction ==<br />
Recently, the problem of how to learn models that generate media such as images, video, audio and text has become very popular and is called Generative Modeling. One of the main benefits of such an approach is that generative models can be trained on unlabeled data that is readily available . Therefore, generative networks have a huge potential in the field of deep learning.<br />
<br />
Generative Adversarial Networks (GANs) are powerful generative models. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.<br />
<br />
Optimal transport theory, which is another approach to measuring distances between distributions, evaluates the distribution distance between the generated data and the training data based on a metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,<br />
2017).<br />
<br />
This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'MIni-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.<br />
<br />
== GANs and Optimal Transport ==<br />
<br />
===Generative Adversarial Nets===<br />
Original GAN was firstly reviewed. The objective function of the GAN: <br />
<br />
[[File:equation1.png|700px]]<br />
<br />
The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium(such that either of them cannot reduce their cost without changing the others' parameters). However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.<br />
<br />
===Wasserstein Distance (Earth-Mover Distance)===<br />
<br />
In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.<br />
<br />
[[File:equation2.png|600px]]<br />
<br />
where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper. <br />
<br />
The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.<br />
<br />
Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.<br />
<br />
[[File:equation3.png|600px]]<br />
<br />
W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.<br />
<br />
===Sinkhorn Distance===<br />
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.<br />
[[File: equation4.png|600px]]<br />
<br />
It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:<br />
<br />
[[File: equation5.png|550px]]<br />
<br />
where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1. <br />
<br />
This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.<br />
<br />
===Energy Distance (Cramer Distance)===<br />
In order to solve the above problem, Bellemare et al. proposed Energy distance:<br />
<br />
[[File: equation6.png|700px]]<br />
<br />
where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.<br />
<br />
==Mini-Batch Energy Distance==<br />
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when using the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.<br />
<br />
===Generalized Energy Distance===<br />
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.<br />
<br />
[[File: equation7.png|670px]]<br />
<br />
Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.<br />
<br />
===Mini-Batch Energy Distance===<br />
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distnace as <math> d </math>. <br />
<br />
[[File: equation8.png|650px]]<br />
<br />
where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal transport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using energy distance, the objective becomes statistically consistent (meaning the objective converges to the true parameter value for large sample sizes) and mini-batch gradients are unbiased.<br />
<br />
==Optimal Transport GAN (OT-GAN)==<br />
<br />
The mini-batch energy distance which was proposed depends on the transport cost function <math>c(x,y)</math>. One possibility would be to choose c to be some fixed function over vectors, like Euclidean distance, but the authors found this to perform poorly in preliminary experiments. For simple fixed cost functions like Euclidean distance, there exists many bad distributions <math>g</math> in higher dimensions for which the mini-batch energy distance is zero such that it is difficult to tell <math>p</math> and <math>g</math> apart if the sample size is not big enough. To solve this the authors propose learning the cost function adversarially, so that it can adapt to the generator distribution <math>g</math> and thereby become more discriminative. <br />
<br />
In practice, in order to secure the statistical efficiency (i.e. being able to tell <math>p</math> and <math>g</math> apart without requiring an enormous sample size when their distance is close to zero), authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. Here is the transportation cost:<br />
<br />
[[File: euqation9.png|370px]]<br />
<br />
where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.<br />
<br />
Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.<br />
<br />
The algorithm is defined below. The backpropagation is not used in the algorithm due to the envelope theorem. Stochastic gradient descent is used as the optimization method in algorithm 1 below, although other optimizers are also possible. In fact, Adam was used in experiments. <br />
<br />
[[File: al.png|600px]]<br />
<br />
<br />
[[File: al_figure.png|600px]]<br />
<br />
==Experiments==<br />
<br />
In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.<br />
<br />
===Mixture of Gaussian Dataset===<br />
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.<br />
<br />
[[File: 5_1.png|600px]]<br />
<br />
===CIFAR-10===<br />
<br />
The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size, which would more likely cover more modes in the distribution of data, lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods, ran using a batch size of 8000, are compared in Table 1 where the OT_GAN has the best score.<br />
<br />
[[File: 5_2.png|600px]]<br />
<br />
===ImageNet Dogs===<br />
<br />
In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN. <br />
<br />
[[FIle: 5_3.png|600px]]<br />
<br />
<br />
To analyze mode collapse in GANs the authors trained both types of GANs for a large number of epochs. They find the DCGAN shows mode collapse as soon as 900 epochs. They trained the OT-GAN for 13000 epochs and saw no evidence of mode collapse or less diversity in the samples. Samples can be viewed in Figure 9.<br />
<br />
[[File: ModelCollapseImageNetDogs.png|600px]]<br />
<br />
===Conditional Generation of Birds===<br />
<br />
The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models. <br />
<br />
[[File: 5_4.png|600px]]<br />
<br />
The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.<br />
<br />
[[File: paper23_alg2.png|600px]]<br />
<br />
==Conclusion==<br />
<br />
In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. The results showed OT-GAN to be uniquely stable when trained with large mini batches and state of the art results were achieved on some datasets. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.<br />
<br />
==Critique==<br />
<br />
The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated based on the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency were not discussed much. Furthermore, in section 2, the paper is lack of explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations.<br />
<br />
==Reference==<br />
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT&diff=35220stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT2018-03-22T16:54:31Z<p>H5tahir: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
Recently, the problem of how to learn models that generate media such as images, video, audio and text has become very popular and is called Generative Modeling. One of the main benefits of such an approach is that generative models can be trained on unlabeled data that is readily available . Therefore, generative networks have a huge potential in the field of deep learning.<br />
<br />
Generative Adversarial Networks (GANs) are powerful generative models. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.<br />
<br />
Optimal transport theory, which is another approach to measuring distances between distributions, evaluates the distribution distance between the generated data and the training data based on a metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,<br />
2017).<br />
<br />
This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'MIni-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.<br />
<br />
== GANs and Optimal Transport ==<br />
<br />
===Generative Adversarial Nets===<br />
Original GAN was firstly reviewed. The objective function of the GAN: <br />
<br />
[[File:equation1.png|700px]]<br />
<br />
The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium(such that either of them cannot reduce their cost without changing the others' parameters). However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.<br />
<br />
===Wasserstein Distance (Earth-Mover Distance)===<br />
<br />
In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.<br />
<br />
[[File:equation2.png|600px]]<br />
<br />
where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper. <br />
<br />
The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.<br />
<br />
Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.<br />
<br />
[[File:equation3.png|600px]]<br />
<br />
W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.<br />
<br />
===Sinkhorn Distance===<br />
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.<br />
[[File: equation4.png|600px]]<br />
<br />
It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:<br />
<br />
[[File: equation5.png|550px]]<br />
<br />
where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1. <br />
<br />
This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.<br />
<br />
===Energy Distance (Cramer Distance)===<br />
In order to solve the above problem, Bellemare et al. proposed Energy distance:<br />
<br />
[[File: equation6.png|700px]]<br />
<br />
where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.<br />
<br />
==Mini-Batch Energy Distance==<br />
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when using the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.<br />
<br />
===Generalized Energy Distance===<br />
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.<br />
<br />
[[File: equation7.png|670px]]<br />
<br />
Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.<br />
<br />
===Mini-Batch Energy Distance===<br />
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distnace as <math> d </math>. <br />
<br />
[[File: equation8.png|650px]]<br />
<br />
where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal transport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using energy distance, the objective becomes statistically consistent (meaning the objective converges to the true parameter value for large sample sizes) and mini-batch gradients are unbiased.<br />
<br />
==Optimal Transport GAN (OT-GAN)==<br />
<br />
The mini-batch energy distance which was proposed depends on the transport cost function <math>c(x,y)</math>. One possibility would be to choose c to be some fixed function over vectors, like Euclidean distance, but the authors found this to perform poorly in preliminary experiments. For simple fixed cost functions like Euclidean distance, there exists many bad distributions <math>g</math> in higher dimensions for which the mini-batch energy distance is zero such that it is difficult to tell <math>p</math> and <math>g</math> apart if the sample size is not big enough. To solve this the authors propose learning the cost function adversarially, so that it can adapt to the generator distribution <math>g</math> and thereby become more discriminative. <br />
<br />
In practice, in order to secure the statistical efficiency (i.e. being able to tell <math>p</math> and <math>g</math> apart without requiring an enormous sample size when their distance is close to zero), authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. Here is the transportation cost:<br />
<br />
[[File: euqation9.png|370px]]<br />
<br />
where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.<br />
<br />
Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.<br />
<br />
The algorithm is defined below. The backpropagation is not used in the algorithm due to the envelope theorem. Stochastic gradient descent is used as the optimization method in algorithm 1 below, although other optimizers are also possible. In fact, Adam was used in experiments. <br />
<br />
[[File: al.png|600px]]<br />
<br />
<br />
[[File: al_figure.png|600px]]<br />
<br />
==Experiments==<br />
<br />
In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.<br />
<br />
===Mixture of Gaussian Dataset===<br />
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.<br />
<br />
[[File: 5_1.png|600px]]<br />
<br />
===CIFAR-10===<br />
<br />
The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size, which would more likely cover more modes in the distribution of data, lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods, ran using a batch size of 8000, are compared in Table 1 where the OT_GAN has the best score.<br />
<br />
[[File: 5_2.png|600px]]<br />
<br />
===ImageNet Dogs===<br />
<br />
In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN. <br />
<br />
[[FIle: 5_3.png|600px]]<br />
<br />
<br />
To analyze mode collapse in GANs the authors trained both types of GANs for a large number of epochs. They find the DCGAN shows mode collapse as soon as 900 epochs. They trained the OT-GAN for 13000 epochs and saw no evidence of mode collapse or less diversity in the samples. Samples can be viewed in Figure 9.<br />
<br />
[[File: ModelCollapseImageNetDogs.png|600px]]<br />
<br />
===Conditional Generation of Birds===<br />
<br />
The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models. <br />
<br />
[[File: 5_4.png|600px]]<br />
<br />
The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.<br />
<br />
[[File: paper23_alg2.png|600px]]<br />
<br />
==Conclusion==<br />
<br />
In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.<br />
<br />
==Critique==<br />
<br />
The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated based on the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency were not discussed much. Furthermore, in section 2, the paper is lack of explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations.<br />
<br />
==Reference==<br />
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning&diff=35219End-to-End Differentiable Adversarial Imitation Learning2018-03-22T16:43:12Z<p>H5tahir: /* Experiments */</p>
<hr />
<div>= Introduction =<br />
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.<br />
<br />
To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned. <br />
<br />
The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992). This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.<br />
<br />
<br />
[[File:GeneratorFollowingDiscriminator.png|center]]<br />
<br />
Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.<br />
<br />
= Background =<br />
== Markov Decision Process ==<br />
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :<br />
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.<br />
<br />
== Imitation Learning ==<br />
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy. <br />
<br />
This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.<br />
<br />
== Generative Adversarial Networks ==<br />
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:<br />
<br />
\begin{align} <br />
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]<br />
\end{align}<br />
<br />
In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.<br />
<br />
GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:<br />
<br />
\begin{align} <br />
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))<br />
\end{align}<br />
<br />
where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.<br />
<br />
This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.<br />
<br />
The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]<br />
\end{align}<br />
<br />
where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:<br />
<br />
\begin{align}<br />
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]<br />
\end{align}<br />
<br />
<br />
REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.<br />
<br />
= Algorithm =<br />
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.<br />
<br />
== The discriminator network ==<br />
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.<br />
<br />
The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:<br />
<br />
\begin{aligned}<br />
D(s,a) &= p(\pi|s,a) \\<br />
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\<br />
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\<br />
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\<br />
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\<br />
\end{aligned}<br />
<br />
Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:<br />
<br />
\begin{aligned}<br />
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}<br />
\end{aligned}<br />
<br />
to get the final expression for <math> D(s,a) </math>:<br />
\begin{aligned}<br />
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}<br />
\end{aligned}<br />
<br />
<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):<br />
<br />
\begin{aligned}<br />
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\end{aligned}<br />
<br />
== Backpropagating through stochastic units ==<br />
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.<br />
<br />
=== Continuous Action Distributions ===<br />
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}<br />
\end{align}<br />
<br />
=== Categorical Action Distributions ===<br />
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:<br />
<br />
\begin{align}<br />
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).<br />
\end{align}<br />
<br />
Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick:<br />
<br />
\begin{align}<br />
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}<br />
\end{align}<br />
<br />
<br />
In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.<br />
<br />
The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.<br />
<br />
== Backpropagating through a Forward model ==<br />
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:<br />
<br />
\begin{align}<br />
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\<br />
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )<br />
\end{align}<br />
<br />
<br />
Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.<br />
<br />
[[File:modelFree_blockDiagram.PNG|400px|center]]<br />
<br />
Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.<br />
<br />
[[File:modelBased_blockDiagram.PNG|700px|center]]<br />
<br />
Figure 3: Block diagram of model-based adversarial imitation learning. <br />
<br />
Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.<br />
<br />
== MGAIL Algorithm ==<br />
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:<br />
<br />
\begin{align}<br />
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]<br />
\end{align}<br />
<br />
<br />
Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:<br />
<br />
\begin{align}<br />
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\<br />
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]<br />
\end{align}<br />
<br />
The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.<br />
<br />
[[File:MGAIL_alg.PNG]]<br />
<br />
== Forward Model Structure ==<br />
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.<br />
<br />
[[File:performance_comparison.PNG]]<br />
<br />
Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).<br />
<br />
= Experiments =<br />
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid), which are modeled by the MuJoCo physics simulator (Todorov et al., 2012) and contain second order dynamics and utilize direct torque control. Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.<br />
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearity and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them. A comparison between the basic forward model and the more advanced forward model is also made and described in the previous section of this summary. The two models compared are shown below.<br />
<br />
[[File:baram17_forward.PNG]]<br />
<br />
[[File:mgail_test_results_1.PNG]]<br />
<br />
[[File:mgail_test_results.PNG]]<br />
<br />
Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.<br />
<br />
= Discussion =<br />
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper, which is to reset the learning rate several times during training period, but it did not result in significant improvements.<br />
<br />
= Source =<br />
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.<br />
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.<br />
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).<br />
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.<br />
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:baram17_forward.PNG&diff=35218File:baram17 forward.PNG2018-03-22T16:42:47Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning&diff=35217End-to-End Differentiable Adversarial Imitation Learning2018-03-22T16:40:45Z<p>H5tahir: /* Experiments */</p>
<hr />
<div>= Introduction =<br />
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.<br />
<br />
To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned. <br />
<br />
The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992). This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.<br />
<br />
<br />
[[File:GeneratorFollowingDiscriminator.png|center]]<br />
<br />
Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.<br />
<br />
= Background =<br />
== Markov Decision Process ==<br />
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :<br />
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.<br />
<br />
== Imitation Learning ==<br />
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy. <br />
<br />
This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.<br />
<br />
== Generative Adversarial Networks ==<br />
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:<br />
<br />
\begin{align} <br />
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]<br />
\end{align}<br />
<br />
In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.<br />
<br />
GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:<br />
<br />
\begin{align} <br />
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))<br />
\end{align}<br />
<br />
where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.<br />
<br />
This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.<br />
<br />
The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]<br />
\end{align}<br />
<br />
where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:<br />
<br />
\begin{align}<br />
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]<br />
\end{align}<br />
<br />
<br />
REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.<br />
<br />
= Algorithm =<br />
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.<br />
<br />
== The discriminator network ==<br />
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.<br />
<br />
The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:<br />
<br />
\begin{aligned}<br />
D(s,a) &= p(\pi|s,a) \\<br />
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\<br />
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\<br />
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\<br />
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\<br />
\end{aligned}<br />
<br />
Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:<br />
<br />
\begin{aligned}<br />
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}<br />
\end{aligned}<br />
<br />
to get the final expression for <math> D(s,a) </math>:<br />
\begin{aligned}<br />
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}<br />
\end{aligned}<br />
<br />
<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):<br />
<br />
\begin{aligned}<br />
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\end{aligned}<br />
<br />
== Backpropagating through stochastic units ==<br />
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.<br />
<br />
=== Continuous Action Distributions ===<br />
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}<br />
\end{align}<br />
<br />
=== Categorical Action Distributions ===<br />
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:<br />
<br />
\begin{align}<br />
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).<br />
\end{align}<br />
<br />
Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick:<br />
<br />
\begin{align}<br />
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}<br />
\end{align}<br />
<br />
<br />
In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.<br />
<br />
The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.<br />
<br />
== Backpropagating through a Forward model ==<br />
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:<br />
<br />
\begin{align}<br />
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\<br />
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )<br />
\end{align}<br />
<br />
<br />
Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.<br />
<br />
[[File:modelFree_blockDiagram.PNG|400px|center]]<br />
<br />
Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.<br />
<br />
[[File:modelBased_blockDiagram.PNG|700px|center]]<br />
<br />
Figure 3: Block diagram of model-based adversarial imitation learning. <br />
<br />
Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.<br />
<br />
== MGAIL Algorithm ==<br />
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:<br />
<br />
\begin{align}<br />
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]<br />
\end{align}<br />
<br />
<br />
Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:<br />
<br />
\begin{align}<br />
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\<br />
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]<br />
\end{align}<br />
<br />
The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.<br />
<br />
[[File:MGAIL_alg.PNG]]<br />
<br />
== Forward Model Structure ==<br />
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.<br />
<br />
[[File:performance_comparison.PNG]]<br />
<br />
Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).<br />
<br />
= Experiments =<br />
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid), which are modeled by the MuJoCo physics simulator (Todorov et al., 2012) and contain second order dynamics and utilize direct torque control. Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.<br />
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearity and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them. A comparison between the basic forward model and the more advanced forward model is also made and described in the previous sections of this summary.<br />
<br />
[[File:mgail_test_results_1.PNG]]<br />
<br />
[[File:mgail_test_results.PNG]]<br />
<br />
Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.<br />
<br />
= Discussion =<br />
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper, which is to reset the learning rate several times during training period, but it did not result in significant improvements.<br />
<br />
= Source =<br />
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.<br />
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.<br />
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).<br />
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.<br />
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning&diff=35216End-to-End Differentiable Adversarial Imitation Learning2018-03-22T16:35:53Z<p>H5tahir: /* Experiments */</p>
<hr />
<div>= Introduction =<br />
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.<br />
<br />
To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned. <br />
<br />
The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992). This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.<br />
<br />
<br />
[[File:GeneratorFollowingDiscriminator.png|center]]<br />
<br />
Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.<br />
<br />
= Background =<br />
== Markov Decision Process ==<br />
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :<br />
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.<br />
<br />
== Imitation Learning ==<br />
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy. <br />
<br />
This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.<br />
<br />
== Generative Adversarial Networks ==<br />
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:<br />
<br />
\begin{align} <br />
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]<br />
\end{align}<br />
<br />
In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.<br />
<br />
GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:<br />
<br />
\begin{align} <br />
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))<br />
\end{align}<br />
<br />
where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.<br />
<br />
This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.<br />
<br />
The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]<br />
\end{align}<br />
<br />
where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:<br />
<br />
\begin{align}<br />
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]<br />
\end{align}<br />
<br />
<br />
REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.<br />
<br />
= Algorithm =<br />
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.<br />
<br />
== The discriminator network ==<br />
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.<br />
<br />
The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:<br />
<br />
\begin{aligned}<br />
D(s,a) &= p(\pi|s,a) \\<br />
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\<br />
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\<br />
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\<br />
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\<br />
\end{aligned}<br />
<br />
Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:<br />
<br />
\begin{aligned}<br />
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}<br />
\end{aligned}<br />
<br />
to get the final expression for <math> D(s,a) </math>:<br />
\begin{aligned}<br />
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}<br />
\end{aligned}<br />
<br />
<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):<br />
<br />
\begin{aligned}<br />
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\end{aligned}<br />
<br />
== Backpropagating through stochastic units ==<br />
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.<br />
<br />
=== Continuous Action Distributions ===<br />
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}<br />
\end{align}<br />
<br />
=== Categorical Action Distributions ===<br />
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:<br />
<br />
\begin{align}<br />
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).<br />
\end{align}<br />
<br />
Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick:<br />
<br />
\begin{align}<br />
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}<br />
\end{align}<br />
<br />
<br />
In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.<br />
<br />
The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.<br />
<br />
== Backpropagating through a Forward model ==<br />
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:<br />
<br />
\begin{align}<br />
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\<br />
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )<br />
\end{align}<br />
<br />
<br />
Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.<br />
<br />
[[File:modelFree_blockDiagram.PNG|400px|center]]<br />
<br />
Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.<br />
<br />
[[File:modelBased_blockDiagram.PNG|700px|center]]<br />
<br />
Figure 3: Block diagram of model-based adversarial imitation learning. <br />
<br />
Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.<br />
<br />
== MGAIL Algorithm ==<br />
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:<br />
<br />
\begin{align}<br />
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]<br />
\end{align}<br />
<br />
<br />
Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:<br />
<br />
\begin{align}<br />
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\<br />
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]<br />
\end{align}<br />
<br />
The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.<br />
<br />
[[File:MGAIL_alg.PNG]]<br />
<br />
== Forward Model Structure ==<br />
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.<br />
<br />
[[File:performance_comparison.PNG]]<br />
<br />
Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).<br />
<br />
= Experiments =<br />
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid), which are modeled by the MuJoCo physics simulator (Todorov et al., 2012). Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.<br />
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearity and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them.<br />
<br />
[[File:mgail_test_results_1.PNG]]<br />
<br />
[[File:mgail_test_results.PNG]]<br />
<br />
Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.<br />
<br />
= Discussion =<br />
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper, which is to reset the learning rate several times during training period, but it did not result in significant improvements.<br />
<br />
= Source =<br />
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.<br />
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.<br />
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).<br />
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.<br />
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=End-to-End_Differentiable_Adversarial_Imitation_Learning&diff=35215End-to-End Differentiable Adversarial Imitation Learning2018-03-22T16:35:41Z<p>H5tahir: /* Experiments */</p>
<hr />
<div>= Introduction =<br />
The ability to imitate an expert policy is very beneficial in the case of automating human demonstrated tasks. Assuming that a sequence of state action pairs (trajectories) of an expert policy are available, a new policy can be trained that imitates the expert without having access to the original reward signal used by the expert. There are two main approaches to solve the problem of imitating a policy; they are Behavioural Cloning (BC) and Inverse Reinforcement Learning (IRL). BC directly learns the conditional distribution of actions over states in a supervised fashion by training on single time-step state-action pairs. The disadvantage of BC is that the training requires large amounts of expert data, which is hard to obtain. In addition, an agent trained using BC is unaware of how its action can affect future state distribution. The second method using IRL involves recovering a reward signal under which the expert is uniquely optimal; the main disadvantage is that it’s an ill-posed problem.<br />
<br />
To address the problem of imitating an expert policy, techniques based on Generative Adversarial Networks (GANs) have been proposed in recent years. GANs use a discriminator to guide the generative model towards producing patterns like those of the expert. The generator is guided as it tries to produce samples on the correct side of the discriminators decision boundary hyper-plane, as seen in Figure 1. This idea was used by (Ho & Ermon, 2016) in their work titled Generative Adversarial Imitation Learning (GAIL) to imitate an expert policy in a model-free setup. A model free setup is the one where the agent cannot make predictions about what the next state and reward will be before it takes each action since the transition function to move from state A to state B is not learned. <br />
<br />
The disadvantage of the model-free approach comes to light when training stochastic policies. The presence of stochastic elements breaks the flow of information (gradients) from one neural network to the other, thus prohibiting the use of backpropagation. In this situation, a standard solution is to use gradient estimation (Williams, 1992). This tends to suffer from high variance, resulting in a need for larger sample sizes as well as variance reduction methods. This paper proposes a model-based imitation learning algorithm (MGAIL), in which information propagates from the guiding neural network (D) to the generative model (G), which in this case represents the policy <math>\pi</math> that is to be trained. This is achieved by two steps: (1) learning a forward model that approximates the environment’s dynamics (2) building an end-to-end differentiable computation graph that spans over multiple time-steps. The gradient in such a graph carries information from future states to earlier time-steps, helping the policy to account for compounding errors.<br />
<br />
<br />
[[File:GeneratorFollowingDiscriminator.png|center]]<br />
<br />
Figure 1: '''Illustration of GANs.''' The generative model follows the discriminating hyper-plane defined by the discriminator. Eventually, G will produce patterns similar to the expert patterns.<br />
<br />
= Background =<br />
== Markov Decision Process ==<br />
Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple <math>(S, A, P, r, \rho_0, \gamma)</math> where <math>S</math> is the set of states, <math>A</math> is a set of actions, <math>P :<br />
S × A × S → [0, 1]</math> is the transition probability distribution, <math>r : (S × A) → R</math> is the reward function, <math>\rho_0 : S → [0, 1]</math> is the distribution over initial states, and <math>γ ∈ (0, 1)</math> is the discount factor. Let <math>π</math> denote a stochastic policy <math>π : S × A → [0, 1]</math>, <math>R(π)</math> denote its expected discounted reward: <math>E_πR = E_π [\sum_{t=0}^T \gamma^t r_t]</math> and <math>τ</math> denote a trajectory of states and actions <math>τ = {s_0, a_0, s_1, a_1, ...}</math>.<br />
<br />
== Imitation Learning ==<br />
A common technique for performing imitation learning is to train a policy <math> \pi </math> that minimizes some loss function <math> l(s, \pi(s)) </math> with respect to a discounted state distribution encountered by the expert: <math> d_\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p(s_t) </math>. This can be obtained using any supervised learning (SL) algorithm, but the policy's prediction affects future state distributions; this violates the independent and identically distributed (i.i.d) assumption made by most SL algorithms. This process is susceptible to compounding errors since a slight deviation in the learner's behavior can lead to different state distributions not encountered by the expert policy. <br />
<br />
This issue was overcome through the use of the Forward Training (FT) algorithm which trains a non-stationary policy iteratively over time. At each time step a new policy is trained on the state distribution induced by the previously trained policies <math>\pi_0</math>, <math>\pi_1</math>, ...<math>\pi_{t-1}</math>. This is continued till the end of the time horizon to obtain a policy that can mimic the expert policy. This requirement to train a policy at each time step till the end makes the FT algorithm impractical for cases where the time horizon is very large or undefined. This shortcoming is resolved using the Stochastic Mixing Iterative Learning (SMILe) algorithm. SMILe trains a stochastic stationary policy over several iterations under the trajectory distribution induced by the previously trained policy: <math> \pi_t = \pi_{t-1} + \alpha (1 - \alpha)^{t-1}(\hat{\pi}_t - \pi_0)</math>, with <math>\pi_0</math> following expert's policy at the start of training.<br />
<br />
== Generative Adversarial Networks ==<br />
GANs learn a generative model that can fool the discriminator by using a two-player zero-sum game:<br />
<br />
\begin{align} <br />
\underset{G}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{x\sim p_E}[log(D(x)]\ +\ \mathbb{E}_{z\sim p_z}[log(1 - D(G(z)))]<br />
\end{align}<br />
<br />
In the above equation, <math> p_E </math> represents the expert distribution and <math> p_z </math> represents the input noise distribution from which the input to the generator is sampled. The generator produces patterns and the discriminator judges if the pattern was generated or from the expert data. When the discriminator cannot distinguish between the two distributions the game ends and the generator has learned to mimic the expert. GANs rely on basic ideas such as binary classification and algorithms such as backpropagation in order to learn the expert distribution.<br />
<br />
GAIL applies GANs to the task of imitating an expert policy in a model-free approach. GAIL uses similar objective functions like GANs, but the expert distribution in GAIL represents the joint distribution over state action tuples:<br />
<br />
\begin{align} <br />
\underset{\pi}{\operatorname{argmin}}\; \underset{D\in (0,1)}{\operatorname{argmax}} = \mathbb{E}_{\pi}[log(D(s,a)]\ +\ \mathbb{E}_{\pi_E}[log(1 - D(s,a))] - \lambda H(\pi))<br />
\end{align}<br />
<br />
where <math> H(\pi) \triangleq \mathbb{E}_{\pi}[-log\: \pi(a|s)]</math> is the entropy.<br />
<br />
This problem cannot be solved using the standard methods described for GANs because the generator in GAIL represents a stochastic policy. The exact form of the first term in the above equation is given by: <math> \mathbb{E}_{s\sim \rho_\pi(s)}\mathbb{E}_{a\sim \pi(\cdot |s)} [log(D(s,a)] </math>.<br />
<br />
The two-player game now depends on the stochastic properties (<math> \theta </math>) of the policy, and it is unclear how to differentiate the above equation with respect to <math> \theta </math>. This problem can be overcome using score functions such as REINFORCE to obtain an unbiased gradient estimation:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi} [log\; D(s,a)] \cong \hat{\mathbb{E}}_{\tau_i}[\nabla_\theta\; log\; \pi_\theta(a|s)Q(s,a)]<br />
\end{align}<br />
<br />
where <math> Q(\hat{s},\hat{a}) </math> is the score function of the gradient:<br />
<br />
\begin{align}<br />
Q(\hat{s},\hat{a}) = \hat{\mathbb{E}}_{\tau_i}[log\; D(s,a) | s_0 = \hat{s}, a_0 = \hat{a}]<br />
\end{align}<br />
<br />
<br />
REINFORCE gradients suffer from high variance which makes them difficult to work with even after applying variance reduction techniques. In order to better understand the changes required to fool the discriminator we need access to the gradients of the discriminator network, which can be obtained from the Jacobian of the discriminator. This paper demonstrates the use of a forward model along with the Jacobian of the discriminator to train a policy, without using high-variance gradient estimations.<br />
<br />
= Algorithm =<br />
This section first analyzes the characteristics of the discriminator network, then describes how a forward model can enable policy imitation through GANs. Lastly, the model based adversarial imitation learning algorithm is presented.<br />
<br />
== The discriminator network ==<br />
The discriminator network is trained to predict the conditional distribution: <math> D(s,a) = p(y|s,a) </math> where <math> y \in (\pi_E, \pi) </math>.<br />
<br />
The discriminator is trained on an even distribution of expert and generated examples; hence <math> p(\pi) = p(\pi_E) = \frac{1}{2} </math>. Given this and applying Bayes' theorem, we can rearrange and factor <math> D(s,a) </math> to obtain:<br />
<br />
\begin{aligned}<br />
D(s,a) &= p(\pi|s,a) \\<br />
& = \frac{p(s,a|\pi)p(\pi)}{p(s,a|\pi)p(\pi) + p(s,a|\pi_E)p(\pi_E)} \\<br />
& = \frac{p(s,a|\pi)}{p(s,a|\pi) + p(s,a|\pi_E)} \\<br />
& = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} \\<br />
& = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}} \\<br />
\end{aligned}<br />
<br />
Define <math> \varphi(s,a) </math> and <math> \psi(s) </math> to be:<br />
<br />
\begin{aligned}<br />
\varphi(s,a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}, \psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}<br />
\end{aligned}<br />
<br />
to get the final expression for <math> D(s,a) </math>:<br />
\begin{aligned}<br />
D(s,a) = \frac{1}{1 + \varphi(s,a)\cdot \psi(s)}<br />
\end{aligned}<br />
<br />
<math> \varphi(s,a) </math> represents a policy likelihood ratio, and <math> \psi(s) </math> represents a state distribution likelihood ratio. Based on these expressions, the paper states that the discriminator makes its decisions by answering two questions. The first question relates to state distribution: what is the likelihood of encountering state <math> s </math> under the distribution induces by <math> \pi_E </math> vs <math> \pi </math>? The second question is about behavior: given a state <math> s </math>, how likely is action a under <math> \pi_E </math> vs <math> \pi </math>? The desired change in state is given by <math> \psi_s \equiv \partial \psi / \partial s </math>; this information can by obtained from the partial derivatives of <math> D(s,a) </math>, which is why these derivatives are proposed to be used for training policies (see following sections):<br />
<br />
\begin{aligned}<br />
\nabla_aD &= - \frac{\varphi_a(s,a)\psi(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\nabla_sD &= - \frac{\varphi_s(s,a)\psi(s) + \varphi(s,a)\psi_s(s)}{(1 + \varphi(s,a)\psi(s))^2} \\<br />
\end{aligned}<br />
<br />
== Backpropagating through stochastic units ==<br />
There is interest in training stochastic policies because stochasticity encourages exploration for Policy Gradient methods. This is a problem for algorithms that build differentiable computation graphs where the gradients flow from one component to another since it is unclear how to backpropagate through stochastic units. The following subsections show how to estimate the gradients of continuous and categorical stochastic elements for continuous and discrete action domains respectively.<br />
<br />
=== Continuous Action Distributions ===<br />
In the case of continuous action policies, re-parameterization was used to enable computing the derivatives of stochastic models. Assuming that the stochastic policy has a Gaussian distribution <math> \mathcal{N}(\mu_{\theta} (s), \sigma_{\theta}^2 (s))</math>, where the mean and variance are given by some deterministic functions <math>\mu_{\theta}</math> and <math>\sigma_{\theta}</math>, then the policy <math> \pi </math> can be written as <math> \pi_\theta(a|s) = \mu_\theta(s) + \xi \sigma_\theta(s) </math>, where <math> \xi \sim N(0,1) </math>. This way, the authors are able to get a Monte-Carlo estimator of the derivative of the expected value of <math> D(s, a) </math> with respect to <math> \theta </math>:<br />
<br />
\begin{align}<br />
\nabla_\theta\mathbb{E}_{\pi(a|s)}D(s,a) = \mathbb{E}_{\rho (\xi )}\nabla_a D(a,s) \nabla_\theta \pi_\theta(a|s) \cong \frac{1}{M}\sum_{i=1}^{M} \nabla_a D(s,a) \nabla_\theta \pi_\theta(a|s)\Bigr|_{\substack{\xi=\xi_i}}<br />
\end{align}<br />
<br />
=== Categorical Action Distributions ===<br />
In the case of discrete action domains, the paper uses categorical re-parameterization with Gumbel-Softmax. This method relies on the Gumbel-Max trick which is a method for drawing samples from a categorical distribution with class probabilities <math> \pi(a_1|s),\pi(a_2|s),...,\pi(a_N|s) </math>:<br />
<br />
\begin{align}<br />
a_{argmax} = \underset{i}{argmax}[g_i + log\ \pi(a_i|s)]\textrm{, where } g_i \sim Gumbel(0, 1).<br />
\end{align}<br />
<br />
Gumbel-Softmax provides a differentiable approximation of the samples obtained using the Gumbel-Max trick:<br />
<br />
\begin{align}<br />
a_{softmax} = \frac{exp[\frac{1}{\tau}(g_i + log\ \pi(a_i|s))]}{\sum_{j=1}^{k}exp[\frac{1}{\tau}(g_j + log\ \pi(a_i|s))]}<br />
\end{align}<br />
<br />
<br />
In the above equation, the hyper-parameter <math> \tau </math> (temperature) trades bias for variance. When <math> \tau </math> gets closer to zero, the softmax operator acts like argmax resulting in a low bias, but high variance; vice versa when the <math> \tau </math> is large.<br />
<br />
The authors use <math> a_{softmax} </math> to interact with the environment; argmax is applied over <math> a_{softmax} </math> to obtain a single “pure” action, but the continuous approximation is used in the backward pass using the estimation: <math> \nabla_\theta\; a_{argmax} \approx \nabla_\theta\; a_{softmax} </math>.<br />
<br />
== Backpropagating through a Forward model ==<br />
The above subsections presented the means for extracting the partial derivative <math> \nabla_aD </math>. The main contribution of this paper is incorporating the use of <math> \nabla_sD </math>. In a model-free approach the state <math> s </math> is treated as a fixed input, therefore <math> \nabla_sD </math> is discarded. This is illustrated in Figure 2. This work uses a model-based approach which makes incorporating <math> \nabla_sD </math> more involved. In the model-based approach, a state <math> s_t </math> can be written as a function of the previous state action pair: <math> s_t = f(s_{t-1}, a_{t-1}) </math>, where <math> f </math> represents the forward model. Using the forward model and the law of total derivatives we get:<br />
<br />
\begin{align}<br />
\nabla_\theta D(s_t,a_t)\Bigr|_{\substack{s=s_t, a=a_t}} &= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_t}} \\<br />
&= \frac{\partial D}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_t}} + \frac{\partial D}{\partial s}\left (\frac{\partial f}{\partial s}\frac{\partial s}{\partial \theta}\Bigr|_{\substack{s=s_{t-1}}} + \frac{\partial f}{\partial a}\frac{\partial a}{\partial \theta}\Bigr|_{\substack{a=a_{t-1}}} \right )<br />
\end{align}<br />
<br />
<br />
Using this formula, the error regarding deviations of future states <math> (\psi_s) </math> propagate back in time and influence the actions of policies in earlier times. This is summarized in Figure 3.<br />
<br />
[[File:modelFree_blockDiagram.PNG|400px|center]]<br />
<br />
Figure 2: Block-diagram of the model-free approach: given a state <math> s </math>, the policy outputs <math> \mu </math> which is fed to a stochastic sampling unit. An action <math> a </math> is sampled, and together with <math> s </math> are presented to the discriminator network. In the backward phase, the error message <math> \delta_a </math> is blocked at the stochastic sampling unit. From there, a high-variance gradient estimation is used (<math> \delta_{HV} </math>). Meanwhile, the error message <math> \delta_s </math> is flushed.<br />
<br />
[[File:modelBased_blockDiagram.PNG|700px|center]]<br />
<br />
Figure 3: Block diagram of model-based adversarial imitation learning. <br />
<br />
Figure 3 describes the computation graph for training the policy (i.e. G). The discriminator network D is fixed at this stage and is trained separately. At time <math> t </math> of the forward pass, <math> \pi </math> outputs a distribution over actions: <math> \mu_t = \pi(s_t) </math>, from which an action at is sampled. For example, in the continuous case, this is done using the re-parametrization trick: <math> a_t = \mu_t + \xi \cdot \sigma </math>, where <math> \xi \sim N(0,1) </math>. The next state <math> s_{t+1} = f(s_t, a_t) </math> is computed using the forward model (which is also trained separately), and the entire process repeats for time <math> t+1 </math>. In the backward pass, the gradient of <math> \pi </math> is comprised of a.) the error message <math> \delta_a </math> (Green) that propagates fluently through the differentiable approximation of the sampling process. And b.) the error message <math> \delta_s </math> (Blue) of future time-steps, that propagate back through the differentiable forward model.<br />
<br />
== MGAIL Algorithm ==<br />
Shalev- Shwartz et al. (2016) and Heess et al. (2015) built a multi-step computation graph for describing the familiar policy gradient objective; in this case it is given by:<br />
<br />
\begin{align}<br />
J(\theta) = \mathbb{E}\left [ \sum_{t=0}^{T} \gamma ^t D(s_t,a_t)|\theta\right ]<br />
\end{align}<br />
<br />
<br />
Using the results from Heess et al. (2015) this paper demonstrates how to differentiate <math> J(\theta) </math> over a trajectory of <math>(s,a,s’) </math> transitions:<br />
<br />
\begin{align}<br />
J_s &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_s + D_a \pi_s + \gamma J'_{s'}(f_s + f_a \pi_s) \right] \\<br />
J_\theta &= \mathbb{E}_{p(a|s)}\mathbb{E}_{p(s'|s,a)}\left [ D_a \pi_\theta + \gamma (J'_{s'} f_a \pi_\theta + J'_\theta) \right]<br />
\end{align}<br />
<br />
The policy gradient <math> \nabla_\theta J </math> is calculated by applying equations 12 and 13 recursively for <math> T </math> iterations. The MGAIL algorithm is presented below.<br />
<br />
[[File:MGAIL_alg.PNG]]<br />
<br />
== Forward Model Structure ==<br />
The stability of the learning process depends on the prediction accuracy of the forward model, but learning an accurate forward model is challenging by itself. The authors propose methods for improving the performance of the forward model based on two aspects of its functionality. First, the forward model should learn to use the action as an operator over the state space. To accomplish this, the actions and states, which are sampled form different distributions need to be first represented in a shared space. This is done by encoding the state and action using two separate neural networks and combining their outputs to form a single vector. Additionally, multiple previous states are used to predict the next state by representing the environment as an <math> n^{th} </math> order MDP. A gated recurrent units (GRU, a simpler variant on the LSTM model) layer is incorporated into the state encoder to enable recurrent connections from previous states. Using these modifications, the model is able to achieve better, and more stable results compared to the standard forward model based on a feed forward neural network. The comparison is presented in Figure 4.<br />
<br />
[[File:performance_comparison.PNG]]<br />
<br />
Figure 4: Performance comparison between a basic forward model (Blue), and the advanced forward model (Green).<br />
<br />
= Experiments =<br />
The proposed algorithm is evaluated on three discrete control tasks (Cartpole, Mountain-Car, Acrobot), and five continuous control tasks (Hopper, Walker, Half-Cheetah, Ant, and Humanoid), which are modeled by the MuJoCo physics simulator (Todorov et al., 2012). Expert policies are trained using the Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015). Different number of trajectories are used to train the expert for each task, but all trajectories are of length 1000.<br />
The discriminator and generator (policy) networks contains two hidden layers with ReLU non-linearity and are trained using the ADAM optimizer. The total reward received over a period of <math> N </math> steps using BC, GAIL and MGAIL is presented in Table 1. The proposed algorithm achieved the highest reward for most environments while exhibiting performance comparable to the expert over all of them.<br />
<br />
[[File:mgail_test_results_1.PNG]]<br />
[[File:mgail_test_results.PNG]]<br />
<br />
Table 1. Policy performance, boldface indicates better results, <math> \pm </math> represents one standard deviation.<br />
<br />
= Discussion =<br />
This paper presented a model-free algorithm for imitation learning. It demonstrated how a forward model can be used to train policies using the exact gradient of the discriminator network. A downside of this approach is the need to learn a forward model, since this could be difficult in certain domains. Learning the system dynamics directly from raw images is considered as one line of future work. Another future work is to address the violation of the fundamental assumption made by all supervised learning algorithms, which requires the data to be i.i.d. This problem arises because the discriminator and forward models are trained in a supervised learning fashion using data sampled from a dynamic distribution. The authors tried a solution proposed by another paper, which is to reset the learning rate several times during training period, but it did not result in significant improvements.<br />
<br />
= Source =<br />
# Baram, Nir, et al. "End-to-end differentiable adversarial imitation learning." International Conference on Machine Learning. 2017.<br />
# Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.<br />
# Shalev-Shwartz, Shai, et al. "Long-term planning by short-term prediction." arXiv preprint arXiv:1602.01580 (2016).<br />
# Heess, Nicolas, et al. "Learning continuous control policies by stochastic value gradients." Advances in Neural Information Processing Systems. 2015.<br />
# Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:mgail_test_results_1.PNG&diff=35214File:mgail test results 1.PNG2018-03-22T16:35:22Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=PointNet%2B%2B:_Deep_Hierarchical_Feature_Learning_on_Point_Sets_in_a_Metric_Space&diff=35213PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space2018-03-22T16:29:19Z<p>H5tahir: /* Experiments */</p>
<hr />
<div>= Introduction =<br />
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.<br />
<br />
[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]<br />
<br />
<br />
Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:<br />
<br />
# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.<br />
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.<br />
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points. <br />
<br />
Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud. <br />
<br />
[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]<br />
<br />
= Review of PointNet =<br />
<br />
The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.<br />
<br />
The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.<br />
<br />
[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]<br />
<br />
= PointNet++ =<br />
<br />
The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost.<br />
<br />
== Problem Statement ==<br />
<br />
There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn a function that takes <math>X</math> as the input and output a class or per point label to each member of <math>M</math>.<br />
<br />
== Method ==<br />
<br />
=== High Level Overview ===<br />
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]<br />
<br />
The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,<br />
<br />
\begin{aligned}<br />
\text{Input at each level: } N \times (d + c) \text{ matrix}<br />
\end{aligned}<br />
<br />
where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and<br />
<br />
\begin{aligned}<br />
\text{Output at each level: } N' \times (d + c') \text{ matrix}<br />
\end{aligned}<br />
<br />
where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.<br />
<br />
<br />
Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.<br />
<br />
=== Sampling Layer ===<br />
<br />
The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.<br />
<br />
To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.<br />
<br />
=== Grouping Layer ===<br />
<br />
The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.<br />
<br />
Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.<br />
<br />
To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.<br />
<br />
=== PointNet Layer ===<br />
<br />
After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.<br />
<br />
=== Robust Feature Learning under Non-Uniform Sampling Density ===<br />
<br />
The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.<br />
<br />
The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated. This method, however, is computationally expensive because for each region it always applies PointNet to all points. On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, the left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points.<br />
<br />
<br />
[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]<br />
<br />
== Point Cloud Segmentation ==<br />
<br />
If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.<br />
<br />
=== Distance-based Interpolation ===<br />
<br />
Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.<br />
<br />
To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.<br />
<br />
[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]<br />
<br />
=== Skip-connections ===<br />
<br />
In addition, skip connections are used (see the PointNet++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.<br />
<br />
== Experiments ==<br />
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labelling, and classification in non-Euclidean space.<br />
<br />
=== Point Set Classification in Euclidean Metric Space ===<br />
<br />
The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.<br />
<br />
[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]<br />
<br />
In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below.<br />
<br />
[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]<br />
<br />
An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points were reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points.<br />
<br />
[[File:paper28_fig4_chair.png | 300px|thumb|center|An example showing the reduction of points visually. At 256 points, the points making up the object is very spare, however the accuracy is only reduced by 1%]][[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]<br />
<br />
=== Semantic Scene Labelling ===<br />
<br />
The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.<br />
<br />
[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]<br />
<br />
To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.<br />
<br />
[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]<br />
<br />
=== Classification in Non-Euclidean Metric Space ===<br />
<br />
[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]<br />
<br />
Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.<br />
<br />
[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]<br />
<br />
=== Feature Visualization ===<br />
The figure below visualizes what is learned by just the first layer kernels of the network. The model is trained on a dataset the mostly consisted of furniture which explains the lines, corners, and planes visible in the visualization. Visualization is performed by creating a voxel grid in space and only aggregating point sets that activate specific neurons the most.<br />
<br />
[[File:26_8.PNG | 500px|thumb|center|Pointclouds learned from first layer kernels (red is near, blue is far)]]<br />
<br />
== Critique ==<br />
<br />
It seems clear that PointNet is lacking capturing local context between points. PointNet++ seems to be an important extension, but the improvements in the experimental results seem small. Some computational efficiency experiments would have been nice. For example, the processing speed of the network, and the computational efficiency of MRG over MRG.<br />
<br />
== Code ==<br />
<br />
Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2 <br />
<br />
<br />
=Sources=<br />
1. Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017<br />
<br />
2. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:26_8.PNG&diff=35212File:26 8.PNG2018-03-22T16:24:45Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=35211Do Deep Neural Networks Suffer from Crowding2018-03-22T16:15:27Z<p>H5tahir: /* Models */</p>
<hr />
<div>= Introduction =<br />
Ever since the evolution of Deep Networks, there has been a tremendous amount of research and effort that has been put into making machines capable of recognizing objects the same way as humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it and this is a very common real-life experience. This paper focuses on studying the impact of crowding on Deep Neural Networks (DNNs) by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks(DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.<br />
<br />
= Models =<br />
Two types of models are considered: deep convolutional neural networks and eccentricity-dependent models. Based on several hypothesis that pooling is the cause of crowding in human perception, the paper tries to investigate the effects of pooling on the detection of crowded images through these two network types. <br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
===What is the problem in CNNs?===<br />
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. It emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments and its Set-Up =<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object (xax).<br />
<br />
Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.<br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.<br />
[[File:result2.png|750x400px|center]]<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
===DCNN Observations===<br />
* The recognition gets worse with the increase in the number of flankers.<br />
* Convolutional networks are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps in learning invariance.<br />
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier.<br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.<br />
<br />
=Conclusions=<br />
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
<br />
=Critique=<br />
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.<br />
<br />
=References=<br />
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=35209Do Deep Neural Networks Suffer from Crowding2018-03-22T16:11:48Z<p>H5tahir: /* Experiments and its Set-Up */</p>
<hr />
<div>= Introduction =<br />
Ever since the evolution of Deep Networks, there has been a tremendous amount of research and effort that has been put into making machines capable of recognizing objects the same way as humans do. Humans can recognize objects in a way that is invariant to scale, translation, and clutter. Crowding is another visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it and this is a very common real-life experience. This paper focuses on studying the impact of crowding on Deep Neural Networks (DNNs) by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks(DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and will be explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular.<br />
<br />
= Models =<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
1. '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
2. '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
3. '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
===What is the problem in CNNs?===<br />
CNNs fall short in explaining human perceptual invariance. First, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus Even more importantly, CNNs rely on data augmentation to achieve transformation-invariance and obviously a lot of processing is needed for CNNs.<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. It emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing a number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since we have multiple scales of an input image, in some experiments, we perform normalization such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments and its Set-Up =<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, notMNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object (xax).<br />
<br />
Training is done using backpropogation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The test data has different flanker configurations as described above.<br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before.<br />
[[File:result2.png|750x400px|center]]<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
===DCNN Observations===<br />
* The recognition gets worse with the increase in the number of flankers.<br />
* Convolutional networks are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps in learning invariance.<br />
*Flankers similar to the target object helps in recognition since they don't activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier.<br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
- Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
- The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target.<br />
<br />
=Conclusions=<br />
We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects.<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
<br />
=Critique=<br />
This paper just tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such a type of crowding. The eccentricity based model does well only when the target is placed at the center of the image but maybe windowing over the frames instead of taking crops starting from the middle might help.<br />
<br />
=References=<br />
1) Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
2) Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
3) J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Spherical_CNNs&diff=35208Spherical CNNs2018-03-22T15:52:19Z<p>H5tahir: /* Molecular Atomization */</p>
<hr />
<div>= Introduction =<br />
Convolutional Neural Networks (CNNs), or network architectures involving CNNs, are the current state of the art for learning 2D image processing tasks such as semantic segmentation and object detection. CNNs work well in large part due to the property of being translationally equivariant. This property allows a network trained to detect a certain type of object to still detect the object even if it is translated to another position in the image. However, this does not correspond well to spherical signals since projecting a spherical signal onto a plane will result in distortions, as demonstrated in Figure 1. There are many different types of spherical projections onto a 2D plane, as most people know from the various types of world maps, none of which provide all the necessary properties for rotation-invariant learning.<br />
<br />
[[File:paper26-fig1.png|center]]<br />
<br />
= Notation =<br />
Below are listed several important terms:<br />
* '''Unit Sphere''' <math>S^2</math> is defined as a sphere where all of its points are distance of 1 from the origin. The unit sphere can be parameterized by the spherical coordinates <math>\alpha ∈ [0, 2π]</math> and <math>β ∈ [0, π]</math>. This is a two-dimensional manifold with respect to <math>\alpha</math> and <math>β</math>.<br />
* '''<math>S^2</math> Sphere''' The three dimensional surface from a 3D sphere<br />
* '''Spherical Signals''' In the paper spherical images and filters are modeled as continuous functions <math>f : s^2 → \mathbb{R}^K</math>. K is the number of channels. Such as how RGB images have 3 channels a spherical signal can have numerous channels describing the data. Examples of channels which were used can be found in the experiments section.<br />
* '''Rotations - SO(3)''' The group of 3D rotations on an <math>S^2</math> sphere. Sometimes called the "special orthogonal group". In this paper the ZYZ-Euler parameterization is used to represent SO(3) rotations with <math>\alpha, \beta</math>, and <math>\gamma</math>. Any rotation can be broken down into first a rotation (<math>\alpha</math>) about the Z-axis, then a rotation (<math>\beta</math>) about the new Y-axis (Y'), followed by a rotation (<math>\gamma</math>) about the new Z axis (Z"). [In the rest of this paper, to integrate functions on SO(3), the authors use a rotationally invariant probability measure on the Borel subsets of SO(3). This measure is an example of a Haar measure. Haar measures generalize the idea of rotationally invariant probability measures to general topological groups. For more on Haar measures, see (Feldman 2002) ]<br />
<br />
= Related Work =<br />
The related work presented in this paper is very brief, in large part due to the novelty of spherical CNNs and the length of the rest of the paper. The authors enumerate numerous papers which attempt to exploit larger groups of symmetries such as the translational symmetries of CNNs but do not go into specific details for any of these attempts. They do state that all the previous works are limited to discrete groups with the exception of SO(2)-steerable networks.<br />
The authors also mention that previous works exist that analyze spherical images but that these do not have an equivariant architecture. They claim that Spherical CNNs are "the first to achieve equivariance to a continuous, non-commutative group (SO(3))". They also claim to be the first to use the generalized Fourier transform for speed effective performance of group correlation.<br />
<br />
= Correlations on the Sphere and Rotation Group =<br />
Spherical correlation is like planar correlation except instead of translation, there is rotation. The definitions for each are provided as follows:<br />
<br />
'''Planar correlation''' The value of the output feature map at translation <math>\small x ∈ Z^2</math> is computed as an inner product between the input feature map and a filter, shifted by <math>\small x</math>.<br />
<br />
'''Spherical correlation''' The value of the output feature map evaluated at rotation <math>\small R ∈ SO(3)</math> is computed as an inner product between the input feature map and a filter, rotated by <math>\small R</math>.<br />
<br />
'''Rotation of Spherical Signals''' The paper introduces the rotation operator <math>L_R</math>. The rotation operator simply rotates a function (which allows us to rotate the the spherical filters) by <math>R^{-1}</math>. With this definition we have the property that <math>L_{RR'} = L_R L_{R'}</math>.<br />
<br />
'''Inner Products''' The inner product of spherical signals is simply the integral summation on the vector space over the entire sphere.<br />
<br />
<math>\langle\psi , f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (x)dx</math><br />
<br />
<math>dx</math> here is SO(3) rotation invariant and is equivalent to <math>d \alpha sin(\beta) d \beta / 4 \pi </math> in spherical coordinates. This comes from the ZYZ-Euler paramaterization where any rotation can be broken down into first a rotation about the Z-axis, then a rotation about the new Y-axis (Y'), followed by a rotation about the new Z axis (Z"). More details on this are given in Appendix A in the paper.<br />
<br />
By this definition, the invariance of the inner product is then guaranteed for any rotation <math>R ∈ SO(3)</math>. In other words, when subjected to rotations, the volume under a spherical heightmap does not change. The following equations show that <math>L_R</math> has a distinct adjoint (<math>L_{R^{-1}}</math>) and that <math>L_R</math> is unitary and thus preserves orthogonality and distances.<br />
<br />
<math>\langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
::::<math>= \int_{S^2} \sum_{k=1}^K \psi_k (x)f_k (Rx)dx</math><br />
<br />
::::<math>= \langle \psi , L_{R^{-1}} f \rangle</math><br />
<br />
'''Spherical Correlation''' With the above knowledge the definition of spherical correlation of two signals <math>f</math> and <math>\psi</math> is:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi \,, f \rangle = \int_{S^2} \sum_{k=1}^K \psi_k (R^{-1} x)f_k (x)dx</math><br />
<br />
The output of the above equation is a function on SO(3). This can be thought of as for each rotation combination of <math>\alpha , \beta , \gamma </math> there is a different volume under the correlation. The authors make a point of noting that previous work by Driscoll and Healey only ensures circular symmetries about the Z axis and their new formulation ensures symmetry about any rotation.<br />
<br />
'''Rotation of SO(3) Signals''' The first layer of Spherical CNNs take a function on the sphere (<math>S^2</math>) and output a function on SO(3). Therefore, if a Spherical CNN with more than one layer is going to be built there needs to be a way to find the correlation between two signals on SO(3). The authors then generalize the rotation operator (<math>L_R</math>) to encompass acting on signals from SO(3). This new definition of <math>L_R</math> is as follows: (where <math>R^{-1}Q</math> is a composition of rotations, i.e. multiplication of rotation matrices)<br />
<br />
<math>[L_Rf](Q)=f(R^{-1} Q)</math><br />
<br />
'''Rotation Group Correlation''' The correlation of two signals (<math>f,\psi</math>) on SO(3) with K channels is defined as the following:<br />
<br />
<math>[\psi \star f](R) = \langle L_R \psi , f \rangle = \int_{SO(3)} \sum_{k=1}^K \psi_k (R^{-1} Q)f_k (Q)dQ</math><br />
<br />
where dQ represents the ZYZ-Euler angles <math>d \alpha sin(\beta) d \beta d \gamma / 8 \pi^2 </math>. A complete derivation of this can be found in Appendix A.<br />
<br />
'''Equivariance''' The equivariance for the rotation group correlation is similarly demonstrated. A layer is equivariant if for some operator <math>T_R</math>, <math>\Phi \circ L_R = T_R \circ \Phi</math>, and: <br />
<br />
<math>[\psi \star [L_Qf]](R) = \langle L_R \psi , L_Qf \rangle = \langle L_{Q^{-1} R} \psi , f \rangle = [\psi \star f](Q^{-1}R) = [L_Q[\psi \star f]](R) </math>.<br />
<br />
= Implementation with GFFT =<br />
The authors leverage the Generalized Fourier Transform (GFT) and Generalized Fast Fourier Transform (GFFT) algorithms to compute the correlations outlined in the previous section. The Fast Fourier Transform (FFT) can compute correlations and convolutions efficiently by means of the Fourier theorem. The Fourier theorem states that a continuous periodic function can be expressed as a sum of a series of sine or cosine terms (called Fourier coefficients). The FFT can be generalized to <math>S^2</math> and SO(3) and is then called the GFT. The GFT is a linear projection of a function onto orthogonal basis functions. The basis functions are a set of irreducible unitary representations for a group (such as for <math>S^2</math> or SO(3)). For <math>S^2</math> the basis functions are the spherical harmonics <math>Y_m^l(x)</math>. For SO(3) these basis functions are called the Wigner D-functions <math>D_{mn}^l(R)</math>. For both sets of functions the indices are restricted to <math>l\geq0</math> and <math>-l \leq m,n \geq l</math>. The Wigner D-functions are also orthogonal so the Fourier coefficients can be computed by the inner product with the Wigner D-functions (See Appendix C for complete proof). The Wigner D-functions are complete which means that any function (which is well behaved) on SO(3) can be expressed as a linear combination of the Wigner D-functions. The GFT of a function on SO(3) is thus:<br />
<br />
<math>\hat{f^l} = \int_X f(x) D^l(x)dx</math><br />
<br />
where <math>\hat{f}</math> represents the Fourier coefficients. For <math>S^2</math> we have the same equation but with the basis functions <math>Y^l</math>.<br />
<br />
The inverse SO(3) Fourier transform is:<br />
<br />
<math>f(R)=[\mathcal{F}^{-1} \hat{f}](R) = \sum_{l=0}^b (2l + 1) \sum_{m=-l}^l \sum_{n=-l}^l \hat{f_{mn}^l} D_{mn}^l(R) </math><br />
<br />
The bandwidth b represents the maximum frequency and is related to the resolution of the spatial grid. Kostelec and Rockmore are referenced for more knowledge on this topic.<br />
<br />
The authors give proofs (Appendix D) that the SO(3) correlation satisfies the Fourier theorem and the <math>S^2</math> correlation of spherical signals can be computed by the outer products of the <math>S^2</math>-FTs (Shown in Figure 2).<br />
<br />
[[File:paper26-fig2.png|center]]<br />
<br />
The GFFT algorithm details are taken from Kostelec and Rockmore. The authors claim they have the first automatically differentiable implementation of the GFT for <math>S^2</math> and SO(3). The authors do not provide any run time comparisons for real time applications (they just mentioned that FFT can be computed in <math>O(n\mathrm{log}n)</math> time) or any comparisons on training times with/without GFFT. However, they do provide the source code of their implementation at: https://github.com/jonas-koehler/s2cnn.<br />
<br />
= Experiments =<br />
The authors provide several experiments. The first set of experiments are designed to show the numerical stability and accuracy of the outlined methods. The second group of experiments demonstrates how the algorithms can be applied to current problem domains.<br />
<br />
==Equivariance Error==<br />
In this experiment the authors try to show experimentally that their theory of equivariance holds. They express that they had doubts about the equivariance in practice due to potential discretization artifacts since equivariance was proven for the continuous case, with the potential consequence of equivariance not holding being that the weight sharing scheme becomes less effective. The experiment is set up by first testing the equivariance of the SO(3) correlation at different resolutions. 500 random rotations and feature maps (with 10 channels) are sampled. They then calculate the approximation error <math>\small\Delta = \dfrac{1}{n} \sum_{i=1}^n std(L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i))/std(\Phi(f_i))</math><br />
Note: The authors do not mention what the std function is however it is likely the standard deviation function as 'std' is the command for standard deviation in MATLAB.<br />
<math>\Phi</math> is a composition of SO(3) correlation layers with filters which have been randomly initialized. The authors mention that they were expecting <math>\Delta</math> to be zero in the case of perfect equivariance. This is due to, as proven earlier, the following two terms equaling each other in the continuous case: <math>\small L_{R_i} \Phi(f_i) - \phi(L_{R_i} f_i)</math>. The results are shown in Figure 3. <br />
<br />
[[File:paper26-fig3.png|center]]<br />
<br />
<math>\Delta</math> only grows with resolution/layers when there is no activation function. With ReLU activation the error stays constant once slightly higher than 0 resolution. The authors indicate that the error must therefore be from the feature map rotation since this type of error is exact only for bandlimited functions.<br />
<br />
==MNIST Data==<br />
The experiment using MNIST data was created by projecting MNIST digits onto a sphere using stereographic projection to create the resulting images as seen in Figure 4.<br />
<br />
[[File:paper26-fig4.png|center]]<br />
<br />
The authors created two datasets, one with the projected digits and the other with the same projected digits which were then subjected to a random rotation. The spherical CNN architecture used was <math>\small S^2</math>conv-ReLU-SO(3)conv-ReLU-FC-softmax and was attempted with bandwidths of 30,10,6 and 20,40,10 channels for each layer respectively. This model was compared to a baseline CNN with layers conv-ReLU-conv-ReLU-FC-softmax with 5x5 filters, 32,64,10 channels and stride of 3. For comparison this leads to approximately 68K parameters for the baseline and 58K parameters for the spherical CNN. Results can be seen in Table 1. It is clear from the results that the spherical CNN architecture made the network rotationally invariant. Performance on the rotated set is almost identical to the non-rotated set. This is true even when trained on the non-rotated set and tested on the rotated set. Compare this to the non-spherical architecture which becomes unusable when rotating the digits.<br />
<br />
[[File:paper26-tab1.png|center]]<br />
<br />
==SHREC17==<br />
The SHREC dataset contains 3D models from the ShapeNet dataset which are classified into categories. It consists of a regularly aligned dataset and a rotated dataset. The models from the SHREC17 dataset were projected onto a sphere by means of raycasting. Different properties of the objects obtained from the raycast of the original model and the convex hull of the model make up the different channels which are input into the spherical CNN.<br />
<br />
<br />
[[File:paper26-fig5.png|center]]<br />
<br />
<br />
The network architecture used is an initial <math>\small S^2</math>conv-BN-ReLU block which is followed by two SO(3)conv-BN-ReLU blocks. The output is then fed into a MaxPool-BN block then a linear layer to the output for final classification. The architecture for this experiment has ~1.4M parameters, far exceeding the scale of the spherical CNNs in the other experiments.<br />
<br />
This architecture achieves state of the art results on the SHREC17 tasks. The model places 2nd or 3rd in all categories but was not submitted as the SHREC17 task is closed. Table 2 shows the comparison of results with the top 3 submissions in each category. In the table, P@N stands for precision, R@N stands for recall, F1@N stands for F-score, mAP stands for mean average precision, and NDCG stands for normalized discounted cumulative gain in relevance based on whether the category and subcategory labels are predicted correctly. The authors claim the results show empirical proof of the usefulness of spherical CNNs. They elaborate that this is largely due to the fact that most architectures on the SHREC17 competition are highly specialized whereas their model is fairly general.<br />
<br />
<br />
[[File:paper26-tab2.png|center]]<br />
<br />
==Molecular Atomization==<br />
In this experiment a spherical CNN is implemented with an architecture resembling that of ResNet. They use the QM7 dataset which has the task of predicting atomization energy of molecules. The positions and charges given in the dataset are projected onto the sphere using potential functions. This is done as follows. First, for each atom, a sphere is defined around its position with the radius of the sphere kept uniform across all atoms. The radius is chosen as the minimal radius so no intersections between atoms occur in the training set. Finally, using potential functions, a T channel spherical signal is produced for each atom in the molecule as shown in the figure below. A summary of their results is shown in Table 3 along with some of the spherical CNN architecture details. It shows the different RMSE obtained from different methods. The results from this final experiment also seem to be promising as the network the authors present achieves the second best score. They also note that the first place method grows exponentially with the number of atoms per molecule so is unlikely to scale well.<br />
<br />
[[File:paper26-tab3.png|center]]<br />
<br />
[[File:paper26-f6.png|center]]<br />
<br />
= Conclusions =<br />
This paper presents a novel architecture called Spherical CNNs. The paper defines <math>\small S^2</math> and SO(3) cross correlations, shows the theory behind their rotational invariance for continuous functions, and demonstrates that the invariance also applies to the discrete case. An effective GFFT algorithm was implemented and evaluated on two very different datasets with close to state of the art results, demonstrating that there are practical applications to Spherical CNNs.<br />
<br />
For future work the authors believe that improvements can be obtained by generalizing the algorithms to the SE(3) group (SE(3) simply adds translations in 3D space to the SO(3) group). The authors also briefly mention their excitement for applying Spherical CNNs to omnidirectional vision such as in drones and autonomous cars. They state that there is very little publicly available omnidirectional image data which could be why they did not conduct any experiments in this area.<br />
<br />
= Commentary =<br />
The reviews on Spherical CNNs are very positive and it is ranked in the top 1% of papers submitted to ICLR 2018. Positive points are the novelty of the architecture, the wide variety of experiments performed, and the writing. One critique of the original submission is that the related works section only lists, instead of describing, previous methods and that a description of the methods would have provided more clarity. The authors have since expanded the section however I found that it is still limited which the authors attribute to length limitations. Another critique is that the evaluation does not provide enough depth. For example, it would have been great to see an example of omnidirectional vision for spherical networks. However, this is to be expected as it is just the introduction of spherical CNNs and more work is sure to come.<br />
<br />
= Source Code =<br />
Source code is available at:<br />
https://github.com/jonas-koehler/s2cnn<br />
<br />
= Sources =<br />
* T. Cohen et al. Spherical CNNs, 2018.<br />
* J. Feldman. Haar Measure. http://www.math.ubc.ca/~feldman/m606/haar.pdf<br />
* P. Kostelec, D. Rockmore. FFTs on the Rotation Group, 2008.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:paper26-f6.png&diff=35207File:paper26-f6.png2018-03-22T15:52:14Z<p>H5tahir: </p>
<hr />
<div></div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation&diff=34439stat946w18/MaskRNN: Instance Level Video Object Segmentation2018-03-16T18:28:21Z<p>H5tahir: /* MaskRNN: Multiple Instance Level Segmentation */</p>
<hr />
<div>== Introduction ==<br />
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences.<br />
<br />
There are 2 different types of video object segmentation: Unsupervised and Semi-supervised. In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. In an unsupervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. In this paper we look at an unsupervised video object segmentation technique.<br />
<br />
== Background Papers ==<br />
Video object segmentation has been performed using spatio-temporal graphs and deep learning. The Graph based methods construct 3D spatio-temporal graphs in order to model the inter- and the intra-frame relationship of pixels or superpixels in a video.Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.<br />
<br />
=== OSVOS (One-Shot Video Object Segmentation) ===<br />
<br />
[[File:OSVOS.jpg | 1000px]]<br />
<br />
This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.<br />
<br />
During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.<br />
<br />
=== MaskTrack (Learning Video Object Segmentation from Static Images) ===<br />
<br />
[[File:MaskTrack.jpg | 500px]]<br />
<br />
MaskTrack takes the output of the previous frame to improve its predictions to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time (t) + 1 binary segmentation mask from frame (t-1)). The output of the network is the binary segmentation mask for frame at time (t). Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from previous frame to improve its segmentation mask prediction for the next frame.<br />
<br />
The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.<br />
<br />
A parallel ConvNet network is used to generate predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using averaging operation to generate the final predicted segmented mask.<br />
<br />
== Dataset ==<br />
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.<br />
<br />
== MaskRNN: Introduction ==<br />
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.<br />
<br />
== MaskRNN: Overview ==<br />
In a video sequence I = {I¬1, I2, …, IT}, the sequence of T frames are given as input to the network, where the video sequence contains N salient objects. The ground truth for the first frame y*1 is also provided for N salient objects.<br />
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for N deep nets, one for each of the N objects.”[1 - MaskRNN] Each deep net is a made of a object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For N objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an argmax operation at test time.<br />
<br />
== MaskRNN: Multiple Instance Level Segmentation ==<br />
<br />
[[File:2ObjectSeg.jpg | 850px]]<br />
<br />
Image segmentation requires producing a pixel level segmentation mask and this can become a mullti-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an argmax operation where each pixel is assigned to the object containing the largest predicted probability.<br />
<br />
=== MaskRNN: Binary Segmentation Network ===<br />
<br />
[[File:MaskRNNDeepNet.jpg | 850px]]<br />
<br />
The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time (t-1). The wrapping function uses the optical flow between frame (t-1) and frame (t) to generate a new binary segmentation mask for frame (t). The input to the flow stream is the concatenation of the optical flow magnitude between frames (t-1) to (t) and frames (t) to (t+1) and the wrapped prediction of the segmentation mask from frame (t-1). The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.<br />
<br />
=== MaskRNN: Object Localization Network: ===<br />
Using a similar technique to the Faster R-CNN method of object localization, where the ROI pooling of the features of the region proposals (the bounding box proposals here) is performed and passed through Fully connected layers to perform regression. the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream’s as the input to the ROI-pooling layer to generate the predicted bounding box.<br />
<br />
=== Training and Finetuning ===<br />
For training the network depicted in Figure 1, back propagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.<br />
<br />
== MaskRNN: Implementation Details ==<br />
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame (t-1). Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information. <br />
<br />
For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. <br />
<br />
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and the some augmentations of the first frame data. The learning rate is set to 10-5 for online training for 200 iterations.<br />
<br />
== MaskRNN: Experimental Results ==<br />
=== Evaluation Metrics ===<br />
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:<br />
<br />
1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask.<br />
<br />
[[File:IoU.jpg | 200px]]<br />
<br />
2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask using bipartite matching between the bounding pixels of the masks. <br />
<br />
[[File:Fscore.jpg | 200px]]<br />
<br />
3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.<br />
<br />
Temporal Stability measures how well the pixels of the two masks match, while Contour Accuracy measures the accuracy of the contours.<br />
<br />
=== Ablation Study ===<br />
<br />
The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.<br />
<br />
[[File:MaskRNNTable2.jpg | 700px]]<br />
<br />
The above table presents the contribution of each component of the network to the final prediction score. We observe that online fine-tuning improves the performance by a large margin. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net.<br />
<br />
=== Quantitative Evaluation ===<br />
<br />
The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.<br />
<br />
[[File:MaskRNNTable3.jpg | 700px]]<br />
<br />
The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.<br />
<br />
The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.<br />
<br />
[[File:MaskRNNTable4.jpg | 700px]]<br />
<br />
=== Qualitative Evaluation ===<br />
Example qualitative results from the DAVIS 2016 and 2017 datasets are shown in the image below.<br />
[[File:maskrnn_example.png | 700px]]<br />
<br />
Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).<br />
<br />
[[File:maskrnn_example_fail.png | 700px]]<br />
<br />
== Conclusion ==<br />
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Using online fine-tuning the network is adjusted to predict better for the current video sequence.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18&diff=34085stat946w182018-03-14T20:29:26Z<p>H5tahir: /* Paper presentation */</p>
<hr />
<div>=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1fU746Cld_mSqQBCD5qadvkXZW1g-j-kHvmHQ6AMeuqU/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
[https://docs.google.com/forms/d/e/1FAIpQLSdcfYZu5cvpsbzf0Nlxh9TFk8k1m5vUgU1vCLHQNmJog4xSHw/viewform?usp=sf_link Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Feb 27 || || 1|| || || <br />
|-<br />
|Feb 27 || || 2|| || || <br />
|-<br />
|Feb 27 || || 3|| || || <br />
|-<br />
|Mar 1 || Peter Forsyth || 4|| Unsupervised Machine Translation Using Monolingual Corpora Only || [https://arxiv.org/pdf/1711.00043.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]<br />
|-<br />
|Mar 1 || wenqing liu || 5|| Spectral Normalization for Generative Adversarial Networks || [https://openreview.net/pdf?id=B1QRgziT- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Spectral_normalization_for_generative_adversial_network Summary]<br />
|-<br />
|Mar 1 || Ilia Sucholutsky || 6|| One-Shot Imitation Learning || [https://papers.nips.cc/paper/6709-one-shot-imitation-learning.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning Summary]<br />
|-<br />
|Mar 6 || George (Shiyang) Wen || 7|| AmbientGAN: Generative models from lossy measurements || [https://openreview.net/pdf?id=Hy7fDog0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/AmbientGAN:_Generative_Models_from_Lossy_Measurements Summary]<br />
|-<br />
|Mar 6 || Raphael Tang || 8|| Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers || [https://arxiv.org/pdf/1802.00124.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers Summary]<br />
|-<br />
|Mar 6 ||Fan Xia || 9|| Word translation without parallel data ||[https://arxiv.org/pdf/1710.04087.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Word_translation_without_parallel_data Summary]<br />
|-<br />
|Mar 8 || Alex (Xian) Wang || 10 || Self-Normalizing Neural Networks || [http://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks Summary] <br />
|-<br />
|Mar 8 || Michael Broughton || 11|| Convergence of Adam and beyond || [https://openreview.net/pdf?id=ryQu7f-RZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=On_The_Convergence_Of_ADAM_And_Beyond Summary] <br />
|-<br />
|Mar 8 || Wei Tao Chen || 12|| Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data || [https://openreview.net/forum?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Predicting_Floor-Level_for_911_Calls_with_Neural_Networks_and_Smartphone_Sensor_Data Summary]<br />
|-<br />
|Mar 13 || Chunshang Li || 13 || UNDERSTANDING IMAGE MOTION WITH GROUP REPRESENTATIONS || [https://openreview.net/pdf?id=SJLlmG-AZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_Image_Motion_with_Group_Representations Summary] <br />
|-<br />
|Mar 13 || Saifuddin Hitawala || 14 || Robust Imitation of Diverse Behaviors || [https://papers.nips.cc/paper/7116-robust-imitation-of-diverse-behaviors.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robust_Imitation_of_Diverse_Behaviors Summary] <br />
|-<br />
|Mar 13 || Taylor Denouden || 15|| A neural representation of sketch drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Neural_Representation_of_Sketch_Drawings Summary]<br />
|-<br />
|Mar 15 || Zehao Xu || 16|| Synthetic and natural noise both break neural machine translation || [https://openreview.net/pdf?id=BJ8vJebC- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Synthetic_and_natural_noise_both_break_neural_machine_translation Summary]<br />
|-<br />
|Mar 15 || Prarthana Bhattacharyya || 17|| Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-Encoders Summary] <br />
|-<br />
|Mar 15 || Changjian Li || 18|| Label-Free Supervision of Neural Networks with Physics and Domain Knowledge || [https://arxiv.org/pdf/1609.05566.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge Summary]<br />
|-<br />
|Mar 20 || Travis Dunn || 19|| Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments || [https://openreview.net/pdf?id=Sk2u1g-0- Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments Summary]<br />
|-<br />
|Mar 20 || Sushrut Bhalla || 20|| MaskRNN: Instance Level Video Object Segmentation || [https://papers.nips.cc/paper/6636-maskrnn-instance-level-video-object-segmentation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation Summary]<br />
|-<br />
|Mar 20 || Hamid Tahir || 21|| Wavelet Pooling for Convolution Neural Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN Summary]<br />
|-<br />
|Mar 22 || Dongyang Yang|| 22|| Implicit Causal Models for Genome-wide Association Studies || [https://openreview.net/pdf?id=SyELrEeAb Paper] ||<br />
|-<br />
|Mar 22 || Yao Li || 23||Improving GANs Using Optimal Transport || [https://openreview.net/pdf?id=rkQkBnJAb Paper] || <br />
|-<br />
|Mar 22 || Sahil Pereira || 24||End-to-End Differentiable Adversarial Imitation Learning|| [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Paper] || [http://proceedings.mlr.press/v70/baram17a/baram17a.pdf Summary]<br />
|-<br />
|Mar 27 || Jaspreet Singh Sambee || 25|| PixelNN: Example-based Image Synthesis || [https://openreview.net/pdf?id=Syhr6pxCW Paper] || <br />
|-<br />
|Mar 27 || Braden Hurl || 26|| Spherical CNNs || [https://openreview.net/pdf?id=Hkbd5xZRb Paper] || <br />
|-<br />
|Mar 27 || Marko Ilievski || 27|| Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders || [http://proceedings.mlr.press/v70/engel17a/engel17a.pdf Paper] || <br />
|-<br />
|Mar 29 || Alex Pon || 28||PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space || [https://arxiv.org/abs/1706.02413 Paper] ||<br />
|-<br />
|Mar 29 || Sean Walsh || 29||Multi-scale Dense Networks for Resource Efficient Image Classification || [https://arxiv.org/pdf/1703.09844.pdf Paper] ||<br />
|-<br />
|Mar 29 || Jason Ku || 30||MarrNet: 3D Shape Reconstruction via 2.5D Sketches ||[https://arxiv.org/pdf/1711.03129.pdf Paper] ||<br />
|-<br />
|Apr 3 || Tong Yang || 31|| Dynamic Routing Between Capsules. || [http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf Paper] || <br />
|-<br />
|Apr 3 || Benjamin Skikos || 32|| Training and Inference with Integers in Deep Neural Networks || [https://openreview.net/pdf?id=HJGXzmspb Paper] || <br />
|-<br />
|Apr 3 || Weishi Chen || 33|| Tensorized LSTMs for Sequence Learning || [https://arxiv.org/pdf/1711.01577.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&action=edit&redlink=1 Summary] || <br />
|-</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34084Wavelet Pooling CNN2018-03-14T20:28:05Z<p>H5tahir: /* References */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==<br />
Travis Williams and Robert Li. Wavelet Pooling for Convolutional Neural Networks. ICLR 2018.</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34083Wavelet Pooling CNN2018-03-14T20:24:49Z<p>H5tahir: /* Forward Propagation */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and <math>W_\psi[j + 1, k] = h_\psi[-n]*W_\psi[j,n]|_{n = 2k, k <= 0}</math> where <math>\varphi</math> is the approximation function, <math>\psi</math> is the detail function, <math>W_\varphi</math>, <math>W_\psi</math>, are approximation and detail coefficients, <math>h_\varphi[-n]</math> and <math>h_\psi[-n]</math> are time reversed scaling and wavelet vectors, <math>(n)</math> represents the sample in the vector, and <math>j</math> denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is <math>W_\varphi[j, k] = h_\varphi[-n]*W_\varphi[j + 1,n] + h_\psi[-n]*W_\psi[j + 1,n]|_{n = \frac{k}{2}, k <= 0}</math> where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34082Wavelet Pooling CNN2018-03-14T20:20:01Z<p>H5tahir: /* Forward Propagation */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by <math>W_\varphi[j + 1, k] = h_\varphi[-n]*W_\varphi[j,n]|_{n = 2k, k <= 0}</math> and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34081Wavelet Pooling CNN2018-03-14T20:16:11Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = \frac{1}{|R_{ij}|} \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34080Wavelet Pooling CNN2018-03-14T20:15:19Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation <math>a_{kij} = 1/|R_{ij}| \sum_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34079Wavelet Pooling CNN2018-03-14T20:13:18Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34078Wavelet Pooling CNN2018-03-14T20:13:00Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the <math>k^th</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is input activation at <math>(p,q)</math> within <math>R_{ij}</math>, and <math>R_{ij}</math> is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34077Wavelet Pooling CNN2018-03-14T20:12:25Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the \k^th feature map at (i,j), <math>a_{kpq}</math> is input activation at (p,q) within <math>R_{ij}</math>, and <math>R_{ij}</math> is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34076Wavelet Pooling CNN2018-03-14T20:12:12Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where <math>a_{kij}</math> is the output activation of the \k^th feature map at (i,j), <math>a_{kpq}</math> is input activation at (p,q) within <math>R_{ij}</math>, and <math>R_{ij}<math> is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34075Wavelet Pooling CNN2018-03-14T20:10:51Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q) \epsilon R_{ij}} (a_{kpq})</math> where akij is the output activation of the kth feature map at (i,j), akpq is input activation at (p,q) within Rij, and Rij is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34074Wavelet Pooling CNN2018-03-14T20:10:41Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q)\epsilonR_{ij}} (a_{kpq})</math> where akij is the output activation of the kth feature map at (i,j), akpq is input activation at (p,q) within Rij, and Rij is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34073Wavelet Pooling CNN2018-03-14T20:09:43Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_{kij} = max_{(p,q)R_{ij}} (a_kpq)</math> where akij is the output activation of the kth feature map at (i,j), akpq is input activation at (p,q) within Rij, and Rij is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahirhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wavelet_Pooling_CNN&diff=34072Wavelet Pooling CNN2018-03-14T20:09:05Z<p>H5tahir: /* Pooling Background */</p>
<hr />
<div>== Introduction ==<br />
It is generally the case that Convolution Neural Networks (CNNs) out perform vector-based deep learning techniques. As such, the fundamentals of CNNs are good candidates to be innovated in order to improve said performance. The pooling layer is one of these fundamentals, and although various methods exist ranging from deterministic and simple: max pooling and average pooling, to probabilistic: mixed pooling and stochastic pooling, all these methods employ a neighborhood approach to the sub-sampling which, albeit fast and simple, can produce artifacts such as blurring, aliasing, and edge halos (Parker et al., 1983).<br />
<br />
This paper introduces a novel pooling method based on the discrete wavelet transform. Specifically, it uses a second-level wavelet decomposition for the sub-sampling. This method, instead of nearest neighbor interpolation, uses a sub-band method that the authors claim produces less artifacts and represents the underlying features more accurately. Therefore, if pooling is viewed as a lossy process, the reason for employing a wavelet approach is to try to minimize this loss.<br />
<br />
== Pooling Background ==<br />
Pooling essentially means sub-sampling. After the pooling layer, the spatial dimensions of the data is reduced to some degree, with the goal being to compress the data rather than discard some of it. Typical approaches to pooling reduce the dimensionality by using some method to combine a region of values into one value. For max pooling, this can be represented by the equation <math>a_(kij) = max_((p,q)R_(ij)) (a_kpq)</math> where akij is the output activation of the kth feature map at (i,j), akpq is input activation at (p,q) within Rij, and Rij is the size of the pooling region. Mean pooling can be represented by the equation (EQUATION) with everything defined as before. Figure 1 provides a numerical example that can be followed.<br />
<br />
[[File:WT_Fig1.PNG|650px|center|]]<br />
<br />
The paper mentions that these pooling methods, although simple and effective, have shortcomings. Max pooling can omit details from an image if the important features have less intensity than the insignificant ones, and also commonly overfits. On the other hand, average pooling can dilute important features if the data is averaged with values of significantly lower intensities. Figure 2 displays an image of this.<br />
<br />
[[File:WT_Fig2.PNG|650px|center|]]<br />
<br />
== Wavelet Background ==<br />
Data or signals tend to be composed of slowly changing trends (low frequency) as well as fast changing transients (high frequency). Similarly, images have smooth regions of intensity which are perturbed by edges or abrupt changes. We know that these abrupt changes can represent features that are of great importance to us when we perform deep learning. Wavelets are a class of functions that are well localized in time and frequency. Compare this to the Fourier transform which represents signals as the sum of sine waves which oscillate forever (not localized in time and space). The ability of wavelets to be localized in time and space is what makes it suitable for detecting the abrupt changes in an image well. <br />
<br />
Essentially, a wavelet is a fast decaying, oscillating signal with zero mean that only exists for a fixed duration and can be scaled and shifted in time. There are some well defined types of wavelets as shown in Figure 3. The key characteristic of wavelets for us is that they have a band-pass characteristic, and the band can be adjusted based on the scaling and shifting. <br />
<br />
[[File:WT_Fig3.jpg|650px|center|]]<br />
<br />
The paper uses discrete wavelet transform and more specifically a faster variation called Fast Wavelet Transform (FWT) using the Haar wavelet. There also exists a continuous wavelet transform. The main difference in these is how the scale and shift parameters are selected.<br />
<br />
== Discrete Wavelet Transform General==<br />
The discrete wavelet transform for images is essentially applying a low pass and high pass filter to your image where the transfer functions of the filters are related and defined by the type of wavelet used (Haar in this paper). This is shown in the figures below, which also show the recursive nature of the transform. For an image, the per row transform is taken first. This results in a new image where the first half is a low frequency sub-band and the second half is the high frequency sub-band. Then this new image is transformed again per column, resulting in four sub-bands. Generally, the low frequency content approximates the image and the high frequency content represents abrupt changes. Therefore, one can simply take the LL band and perform the transformation again to sub-sample even more.<br />
<br />
[[File:WT_Fig8.png|650px|center|]]<br />
<br />
[[File:WT_Fig9.png|650px|center|]]<br />
<br />
== DWT example using Haar Wavelet ==<br />
Suppose we have an image represented by the following pixels:<br />
<math> \begin{bmatrix} <br />
100 & 50 & 60 & 150 \\<br />
20 & 60 & 40 & 30 \\<br />
50 & 90 & 70 & 82 \\<br />
74 & 66 & 90 & 58 \\<br />
\end{bmatrix} </math><br />
<br />
For each level of the DWT using the Haar wavelet, we will perform the transform on the rows first and then the columns. For the row pass, we transform each row as follows:<br />
* Take row i = [ i1, i2, i3, i4], and let i_t = [a1, a2, d1, d2] represent the transformed row<br />
* a1 = (i1 + i2)/2<br />
* a2 = (i3 + i4)/2<br />
* d1 = (i1 - i2)/2<br />
* d2 = (i3 - i4)/2<br />
<br />
After the row transforms, the images looks as follows:<br />
<math> \begin{bmatrix} <br />
75 & 105 & 25 & -45 \\<br />
40 & 35 & -20 & 5 \\<br />
70 & 76 & -20 & -6 \\<br />
70 & 74 & 4 & 16 \\<br />
\end{bmatrix} </math><br />
<br />
Now we apply the same method to the columns in the exact same way.<br />
<br />
== Proposed Method ==<br />
The proposed method uses subbands from the second level FWT and discards the first level subbands. The authors postulate that this method is more 'organic' in capturing the data compression and will create less artifacts that may affect the image classification.<br />
=== Forward Propagation ===<br />
FWT can be expressed by [Insert equation] and [Insert Equation] where [] is the approximation function, [] is the detail function, W, W, are approximation and detail coefficients, h and h are time reversed scaling and wavelet vectors, (n) represents the sample in the vector, and j denotes the resolution level. To apply to images, FWT is first applied on the rows and then the columns. If a low (L) and high(H) sub-band is extracted from the rows and similarly for the columns than at each level there is 4 sub-bands (LH, HL, HH, and LL) where LL will further be decomposed into the level 2 decomposition. <br />
<br />
Using the level 2 decomposition sub-bands, the Inverse Fast Wavelet Transform (IFWT) is used to obtain the resulting sub-sampled image, which is sub-sampled by a factor of two. The Equation for IFWT is [] where the parameters are the same as previously explained. Figure 4 displays the algorithm for the forward propagation.<br />
<br />
[[File:WT_Fig6.PNG|650px|center|]]<br />
=== Back Propagation ===<br />
This is simply the reverse of the forward propagation. The FWT of the image is upsampled to be used as the level 2 decomposition. Then IFWT is performed to obtain the original image which is upsampled by a factor of two using wavelet methods. Figure 5 displays the algorithm.<br />
<br />
[[File:WT_Fig7.PNG|650px|center|]]<br />
<br />
== Results ==<br />
The authors tested on MNIST, CIFAR-10, SHVN, and KDEF and the paper provides comprehensive results for each. Stochastic gradient descent was used and the Haar wavelet is used due to its even, square subbands. The network for all datasets except MNIST is loosedly based on (Zeiler & Fergus, 2013). The authors keep the network consistent, but change the pooling method for each dataset. They also experiment with dropout and Batch Normalization to examine the effects of regularization on their method. All pooling methods compared use a 2x2 window. The overall results teach us that the pooling method should be chosen specific to the type of data we have. In some cases wavelet pooling may perform the best, and in other cases, other methods may perform better, if the data is more suited for those types of pooling.<br />
<br />
=== MNIST ===<br />
Figure 6 shows the network and Table 1 shows the accuracy. It can be seen that wavelet pooling achieves the best accuracy from all pooling methods compared.<br />
<br />
[[File:WT_Fig4.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab1.PNG|650px|center|]]<br />
<br />
=== CIFAR-10 ===<br />
Figure 7 shows the network and Tables 2 and 3 shows the accuracy without and with dropout. Average pooling achieves the best accuracy but wavelet pooling is still competitive.<br />
<br />
[[File:WT_Fig5.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab2.PNG|650px|center|]]<br />
<br />
[[File:WT_Tab3.PNG|650px|center|]]<br />
<br />
<br />
== Computational Complexity ==<br />
The authors explain that their paper is a proof of concept and is not meant to implement wavelet pooling in the most efficient way. The table below displays a comparison of the number of mathematical operations for each method according to the dataset. It can be seen that wavelet pooling is significantly worse. The authors explain that through good implementation and coding practices, the method can prove to be viable.<br />
<br />
[[File:WT_Tab4.PNG|650px|center|]]<br />
<br />
== Criticism ==<br />
=== Positive ===<br />
* Wavelet Pooling achieves competitive performance with standard go to pooling methods<br />
* Leads to comparison of discrete transformation techniques for pooling (DCT, DFT)<br />
=== Negative ===<br />
* Only 2x2 pooling window used for comparison<br />
* Highly computationally extensive<br />
* Not as simple as other pooling methods<br />
* Only one wavelet used (HAAR wavelet)<br />
<br />
== References ==</div>H5tahir