statwiki - User contributions [US]

Pixels to Graphs by Associative Embedding

2018-12-01T04:39:10Z

J385chen: Added problem motivation and previous work summary (extracted from paper)

== Introduction ==

Extracting semantics from images is one of the main goals of computer vision. Recent years have seen rapid progress in the classification and localization of objects [7, 24, 10]. But a bag of labeled
and localized objects is an impoverished representation of image semantics: it tells us what and where the objects are (“person” and “car”), but does not tell us about their relations and interactions (“person next to car”). A necessary step is thus to not only detect objects but to identify the relations between them. An explicit representation of these semantics is referred to as a scene graph where we represent objects grounded in the scene as vertices and the relationships between them as edges. [1]

End-to-end training of convolutional networks has proven to be a highly effective strategy for image understanding tasks. It is therefore natural to ask whether the same strategy would be viable for predicting graphs from pixels. Existing approaches, however, tend to break the problem down into more manageable steps. For example, one might run an object detection system to propose all of the objects in the scene, then isolate individual pairs of objects to identify the relationships between them. This breakdown often restricts the visual features used in later steps and limits reasoning over the full graph and over the full contents of the image. [1]

The paper presents a novel approach to generating a scene graph. A scene graph, as it relates to an image, is a graph with a vertex that represents each object identified in the image and an edge that represents relationships between the objects.

An example of a scene graph:

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Scene Graph.PNG]]</div>

Current state-of-the-art techniques break down the construction of scene graphs by first identifying objects and then predicting the edges for any given pair of identified objects. By using this technique, reasoning over
the full graph would be limited. On the other hand, this paper introduces an architecture that defines the entire graph directly from the image, enabling the network to reason across the entirety of the image to understand relationships, as opposed to only predicting relationships using object labels.

A key concern, given that the new architecture produces both vertices (objects) and edges (relationships), is connecting the two. Specifically, the output of the network is some set of relationships E, and some set of vertices V. The network needs to also output the “source” and “destination” of each relationship, so that the final graph can be formed. In the image above, for example, the network would also need to tell us that “holding” comes from “person” and goes to “Frisbee”. To do this, the paper uses associative embeddings. Specifically, the network outputs a particular “embedding vector” for each vertex, as well as a “source embedding” and “destination embedding” for each relationship. A final post-processing step finds the vertex embedding closest to each of the source/destination embeddings of each relationship and in this way assigns the edges to pairs of vertices.

== Previous Work ==

In the field of relationship detection, the following are the existing state of the art advances:

1) Framing the task of identifying objects using localization from referential expressions, detection of human-object interactions, or the more general tasks of Visual Relationship Detection (VRD) and scene graph generation.

2) Visual relationship detection methods like message passing RNNs and predicting over triplets of bounding boxes.

In the field of associative embedding, the following are some interesting applications:

1) Vector embeddings to group together body joints for multi-person pose estimation.

2) Vector embeddings to detect body joints of the various people in an image.

Reference Figure from the paper "Associative embedding: End-to-end learning for joint detection and grouping."

[[File:Oct30_associative_embedding_appendix_fig2.jpg | center]]

== Pixels To Graphs ==
The goal of the paper is to construct a graph from a set of pixels. In particular, to construct a graph
grounded in the space of these pixels. Meaning that in addition to identifying vertices of the graph,
we want to know their precise locations. A vertex, in this case, can refer to any object of interest in the
scene including people, cars, clothing, and buildings. The relationships between these objects is then
captured by the edges of the graph. These relationships may include verbs (eating, riding), spatial
relations (on the left of, behind), and comparisons (smaller than, same color as).

Formally we consider a directed graph G = (V, E). A given vertex vi ∈ V is grounded at a location (<math>xi</math>
,<math>yi</math>) and defined by its class and bounding box. Each edge e ∈ E takes the form
ei = (<math>vs</math>,<math>vt</math> ,<math>ri</math>) defining a relationship of type <math>r_i</math> from <math>vs</math> to <math>vt</math> . We train a network to explicitly define V and E. This training is done end-to-end on a single network, allowing the network to reason fully over the image and all possible components of the graph when making its predictions

== The Architecture: ==
: '''1. Detecting Graph Elements'''

Given an image of dimensions h x w, a stacked hourglass (Appendix 2) is used to generate a h x w x f representation of the image. It should be noted that the dimension of the output (which is non-trainable), needs to fulfill certain criteria. Specifically, we need to have a resolution large enough to minimize the number of pixels with multiple detections while also being small enough to ensure that each 1 x 1 x f vector still contains the information needed for subsequent inference.

A 1x1 convolution and sigmoid activation is performed on this result to generate a heat map (one for objects and one for relationships, using separately determined convolutions). The value at a given pixel can be interpreted as the likelihood of detection at that particular pixel in the original image.

In order to claim that there is an element at some pixel, we need to have some likelihood threshold. Then, if a given pixel in the map has a value >= the threshold, we claim that there is an element at that pixel. This threshold is calculated by using binary cross-entropy loss on the final values in the heat map. Values with likelihoods greater than p-hat will be considered element detections.

Finally, for each element that we detected, we extract the 1 x 1 x f feature vector. This is then used as an input to a set of Feed Forward Neural Networks (FFNNs), where we have a separate network for each characteristic of interest, and for each network, there's one hidden layer with f nodes. The object class and relationship (edges) could be supervised by softmax loss. Furthermore, in order to predict the bounding box of the object, we can use the approach proposed by the Faster-RCNN model[3]. The following image summarizes the process.

[[File:Extraction Process.PNG|center]]

:'''2. Connecting Elements with Associative Embeddings'''
As explained earlier, to construct the scene graph, we need to know the source and destination of each edge. This is done through associative embeddings.

First, let us define an embedding hi ϵ Rd produced for some vector i, and let us assume that we have n object detections in a particular image. Now, define hik, for k = 1 to Ki (where Ki is the number of edges in the graph with a vertex at vertex i) as the embedding associated with an edge that touches vertex i. We define two loss functions on these sets.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 1.PNG]]</div>

The goal of Lpull is to minimize the squared differences between the embedding of a given vertex and the embedding of an edge that references said vertex.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 2.PNG]]</div>

On the other hand, minimizing Lpush implies assigning embeddings to vertices that are as far apart as possible. The further apart they are, the lower the output of max becomes until eventually, it reaches 0. Here, m is just a constant. In the paper, the values used were m = 8 and d = 8 (that is, 8D embeddings). Combining these two loss functions (and weighing them equally), accomplishes the task of predicting embeddings such that vertices are differentiated, but the embedding of a vertex is most similar to the vertex it references.

:'''3. Support for Overlapping Detections'''
An obvious concern is how the network would operate if there was more than one detection (be it object or relationship), in a given pixel. For example, detection of “shirt” and “person” may be centered at the exact same pixel. To account for this, the architecture is modified to allow for “slots” at each pixel. Specifically, so detections of objects are allowed at a particular pixel, while sr relationship detections are allowed at a given pixel.

In order to allow for this, some changes are required after the feature extraction step. Specifically, we now use the 1x1xf vector as the input for so (or sr) different sets of 4 FFNNs, where the output (of the first three) is as shown in figure 2, and with the final FFNN outputting the probability of a detection existing in that particular slot, at that particular pixel. This new network is trained exclusively on whether or not a detection has been made in that slot, and, in prediction, is used to determine the number of slots to output at a given pixel. It is critical to note that this each of these so (or sr) sets of FFNNs share absolutely no weights. And each is trained for detection in its assigned slot.

It is important to note that this implies a change in the training procedure. We now have so (or sr) different predictions (be it class, or class + bounding box), that we need to match with our set of ground truth detections at a given pixel. Without this step, we would not be able to assign a value to the error for that sample. To do this, we match a one-hot encoded vector of the ground-truth class and bounding box anchor (the reference vector), and then match them with the so (or sr) outputs provided at a given pixel. The Hungarian method is used to ensure maximum matching between the outputs and the reference method while ensuring we do not assign the same detection to multiple slots.

==Results==
A quick note on notation: R@50 indicates what percentage of ground-truth subject-predicate-object tuples appeared in a proposal of 50 such tuples. Since R@100 offers more possibilities, it will necessarily be higher. The 6.7, for example, indicates that 6.7% of the ground truth tuples appeared in the proposals of the network.

The authors tested the network against two other architectures designed to develop a semantic understanding of images. For this, they used the Visual Genome dataset, with so = 3 and sr = 6. Overall, the new architecture vastly outperformed past models. The results were as follows:

The table can be interpreted as follows:

[[File:Results Table.PNG|center|600px]]

::'''SGGen (no RPN)''': Given a particular image, without the use of Region Proposal networks, the accuracy of the proposed scene graph. No class predictions are provided.
::'''SGGen (with RPN)''': Same as above, except the output of the Region Proposal Network, is used to enhance the input of a given image. No class predictions are provided.
::'''SGCIs''': Ground-truth object bounding boxes are provided. The network is asked to classify them and determine relationships.
::'''PredCIs''': As above, except the classes are also provided. The only goal is to predict relationships.

Further analysis into the accuracy, when looking at predicates individually, shows that the architecture is very sensitive to over-represented relationship predicates.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Results - Part 2.PNG]]</div>

As shown in Figure 5, for many ground-truth predicates (those that do not appear often in the ground truth), the network does poorly. Even when allowed to propose 100 tuples, the network does not offer the predicate. Figure 4 simply observes the fact that certain sets of relationship predicates appear predominantly in a subset of slots. No general explanation has been offered for this behavior.

== Conclusion ==
In conclusion, the paper offers a novel approach that enables the extraction of image semantics while perpetually reasoning over the entire context of the image. Associative embeddings are used to connect object and predicate relationships, and parallel “slots” allow for multiple detections in one pixel. While this approach offers noticeable improvements in accuracy, it is clear that work needs to be done to account for the non-uniform distributions of relationships in the dataset.

== Critiques ==

The paper's contributions towards patterning unordered network outputs and using associative embeddings for connecting vertices and edges are commendable. However, it should be noted this paper is only an incremental improvement over existing well-studied architectures like the hourglass architecture. The modifications are not sufficiently supported by mathematical reasoning. The authors say that they make a slight modification to the hourglass design and double the number of features and weight all the loses equally. No scientific justification for why this is needed is given. Also the choice of constants to be 3 and 6 for <math display = "inline"> s_o</math> and <math display = "inline"> s_r</math> is not clear, as the authors leave out a fraction of the cases. I am not sure if the changes made are truly a critical advance as the experiments are conducted only on a single dataset and no generalizability arguments are made by the authors. So the methods might just work well only for this dataset and the changes may pertain to only this one. The theoretical analysis done in the paper comes directly from the hourglass literature and cannot be accounted for novelty.
The paper could have identified the effect of their treatment by analyzing the structure of the network that they are presenting. However, there are lack of mathematical and structural analysis of each treatment that they are presenting in detailed levels.

== Appendices ==

'''Appendix 1: Sample Outputs'''
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Sample Pixel Graph Outputs.PNG]]</div>

'''Appendix 2: Stacked Hourglass Architecture'''
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Hourglass.PNG]]</div>

Although this goes beyond the focus of the paper, I would like to add a brief overview of the stacked hourglass architecture used to generate the heat map. This architecture is unique in that it allows cyclical top-down, bottom-up inference and recombination of features. While most architectures focus on optimizing the bottom-up portion (reducing dimensionality), the stacked-hourglass gives the network more flexibility in how it generates a representation by allowing it to learn a series of down-sampling / up-sampling steps.

When you downsample and then upsample, a high amount of information is potentially lost on the upsampled reconstruction. Using the naive approach, this often results in poor reconstruction. This problem is accentuated when we stack multiple layers of downsampling and upsampling in the stacked hourglass architecture. To alleviate this issue, we add skip layers. Skip layers essentially allow earlier layers to send outputs into multiple later layers. The added information from the earlier layers ensures that the reconstructed embedding doesn't have its dimensionality reduced too much.

[[File:skip+layers+Max+fusion+made+learning+difficult+due+to+gradient+switching..jpg|center|900px]]

== References ==
1. Alejandro Newell and Jia Deng, “Pixels to Graphs by Associative Embedding,” in NIPS, 2017

2. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016

3. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, pages 91–99, 2015.

Learning to Teach

2018-12-01T04:35:24Z

J385chen:

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half
of the training data to train a ResNet model as the student.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
In supervised learning, the goal is to choose a function <math display="inline">f_w(x)</math> with <math display="inline">w</math> as the parameter vector to predict the supervisor's label as good as possible. The goodness of a function <math display="inline">f_w</math> is evaluated by the risk function:

\begin{align*}R(w) = \int M(y, f_w(x))dP(x,y)\end{align*}

where <math display="inline">M(,)</math> is the metric which evaluate the gap between the label and the prediction.

The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.
In contrast to traditional machine learning, which is only concerned with the student model in the
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide
appropriate inputs to the student model so that it can achieve low risk functional as efficiently
as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,
the teacher model can be used to teach either
new student models, or the same student
models in new learning scenarios such as another
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations
S are the same across different student
models and different scenarios. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is

<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center>

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).

Student model features include the signals reflecting how well current neural network is trained. The authors collect several simple features, such as passed mini-batch number (i.e., iteration), the average historical training loss and historical validation accuracy.

Some additional features are collected to represent the combination of both data and learner model. By using these features, the authors aim to represent how important the arrived training data is for current leaner. The authors mainly use three parts of such signals in our classification tasks: 1) the predicted probabilities of each class; 2) the loss value on that data, which appears frequently in self-paced learning (Kumar et al., 2010; Jiang et al., 2014a; Sachan & Xing, 2016); 3) the margin value.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]. The estimation is based on the gradient <math>\nabla_{\theta} = \sum_{t=1}^{T}E_{\phi_{\theta}}(a_t|s_t)[\nabla_{\theta}log(\phi_{\theta}(a_t|s_t))R(s_t, a_t)]</math>, which is empirically estimated as <math>\sum_{t=1}^{T} \nabla_{\theta}log(\phi_{\theta}(a_t|s_t))v_t</math>. <math>v_t</math> is defined as the sampled estimation of reward <math>R(s_t, a_t)</math> from one execution of the policy. Given that the reward is just the terminal reward, we have <math>\nabla_{\theta} = \sum_{t=1}^{T} \nabla_{\theta}log(\phi_{\theta}(a_t|s_t))r_T</math>

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.

::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.
===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

Also, teaching should not be limited to data, loss function and hypothesis space. In a human teacher-student model, the teaching contents are concepts and logical rules, similar to weights of hidden layers in neural networks. How to transfer such knowledge is interesting to investigate.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.

Learning to Teach

2018-12-01T04:30:03Z

J385chen:

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half
of the training data to train a ResNet model as the student.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
In supervised learning, the goal is to choose a function <math display="inline">f_w(x)</math> with <math display="inline">w</math> as the parameter vector to predict the supervisor's label as good as possible. The goodness of a function <math display="inline">f_w</math> is evaluated by the risk function:

\begin{align*}R(w) = \int M(y, f_w(x))dP(x,y)\end{align*}

where <math display="inline">M(,)</math> is the metric which evaluate the gap between the label and the prediction.

The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.
In contrast to traditional machine learning, which is only concerned with the student model in the
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide
appropriate inputs to the student model so that it can achieve low risk functional as efficiently
as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,
the teacher model can be used to teach either
new student models, or the same student
models in new learning scenarios such as another
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations
S are the same across different student
models and different scenarios. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is

<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center>

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).

Student model features include the signals reflecting how well current neural network is trained. The authors collect several simple features, such as passed mini-batch number (i.e., iteration), the average historical training loss and historical validation accuracy.

Some additional features are collected to represent the combination of both data and learner model. By using these features, the authors aim to represent how important the arrived training data is for current leaner. The authors mainly use three parts of such signals in our classification tasks: 1) the predicted probabilities of each class; 2) the loss value on that data, which appears frequently in self-paced learning (Kumar et al., 2010; Jiang et al., 2014a; Sachan & Xing, 2016); 3) the margin value.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]. The estimation is based on the gradient <math>\nabla_{\theta} = \sum_{t=1}^{T}E_\phi_{\theta}(a_t|s_t)[\nabla_{\theta}log(\phi_{\theta}(a_t|s_t))R(s_t, a_t)]</math>

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.

::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.
===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

Also, teaching should not be limited to data, loss function and hypothesis space. In a human teacher-student model, the teaching contents are concepts and logical rules, similar to weights of hidden layers in neural networks. How to transfer such knowledge is interesting to investigate.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-12-01T04:22:52Z

J385chen:

=Introduction=

In highly populated cities with many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem is in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feed-forward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is an algorithm, which uses the output of the trained LSTM, to predict change in the barometric pressure of the smartphone from when it first entered the building against that of its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

The model does not rely on the external sensors placed inside the building, prior knowledge of the building, nor user movement behaviour. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations.

All the codes and data related to this article are available here[[https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data]]

=Related Work=

In general, previous work falls under two categories. The first category of methods is the classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is to achieve the high accuracy that the user's step size is needed, therefore heavily relying on pre-training to the specific users. In a real world application of this method, this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data, the authors developed an iOS application (called Sensory) that runs on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength, GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second, and each datum contained the different sensor measurements mentions earlier along with environment contexts like building floors, environment activity, city name, country name, and magnetic strength.

The data collection procedure for indoor-outdoor classifier was described as follows:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV for analysis. This procedure can start either outside or inside a building without loss of generality.

The following procedure generates data used to predict a floor change from the entrance floor to the end floor:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9) Save data as CSV for analysis.

Their algorithm was used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure (<math>P</math>), GPS vertical accuracy (<math>GV</math>), GPS horizontal accuracy (<math>GH</math>), GPS speed (<math>S</math>), device RSSI level (<math>rssi</math>), and magnetometer total reading (<math>M</math>).

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= M = \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|750px]] </div>

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|450px]] </div>

The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|250px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|450px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

[3] suggested the use of a reference barometer or beacons as a way to determine the entrances to a building.

However, such need is eliminated by the authors' approach. The authors' second key contribution is to use the LSTM IO predictions to help identifying these indoor transitions into the building. The LSTM provides a self-contained estimator of a building’s entrance without relying on external sensor information on a user’s body or beacons placed inside a building’s lobby. [4]

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level, they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

==Dataset==

In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.

As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning was being used, the actual floor level was recorded manually for the validation purposes only.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|450px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|650px]] </div>

The above chart shows the success that their system is able to achieve in the floor level prediction.

The performance was measured in terms of how many floors were travelled rather than the absolute floor number. Because different buildings might have their floors differently numbered. They used different m values in 2 tests. One applies the same m value across all building and the other one applied specific m values on different buildings. The result showed that this specification on m values hugely increased the accuracy.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate the floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. Also, the article's ideas are sometimes out of order and are repeated in cycles.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm is implemented; otherwise, the algorithm may provide the misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Moreover, authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

The proposed model could introduce privacy risks such as illegal surveillance of mobile phone user and private facilities.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-12-01T04:21:48Z

J385chen:

=Introduction=

In highly populated cities with many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem is in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feed-forward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is an algorithm, which uses the output of the trained LSTM, to predict change in the barometric pressure of the smartphone from when it first entered the building against that of its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

The model does not rely on the external sensors placed inside the building, prior knowledge of the building, nor user movement behaviour. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations.

All the codes and data related to this article are available here[[https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data]]

=Related Work=

In general, previous work falls under two categories. The first category of methods is the classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is to achieve the high accuracy that the user's step size is needed, therefore heavily relying on pre-training to the specific users. In a real world application of this method, this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data, the authors developed an iOS application (called Sensory) that runs on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength, GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second, and each datum contained the different sensor measurements mentions earlier along with environment contexts like building floors, environment activity, city name, country name, and magnetic strength.

The data collection procedure for indoor-outdoor classifier was described as follows:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV for analysis. This procedure can start either outside or inside a building without loss of generality.

The following procedure generates data used to predict a floor change from the entrance floor to the end floor:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9) Save data as CSV for analysis.

Their algorithm was used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure (<math>P</math>), GPS vertical accuracy (<math>GV</math>), GPS horizontal accuracy (<math>GH</math>), GPS speed (<math>S</math>), device RSSI level (<math>rssi</math>), and magnetometer total reading (<math>M</math>).

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= M = \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|750px]] </div>

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|450px]] </div>

The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|250px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|450px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In previous work (Song et al., 2014) suggested the use of a reference barometer or beacons as a way to determine the entrances to a building.

However, such need is eliminated by the authors' approach. The authors' second key contribution is to use the LSTM IO predictions to help identifying these indoor transitions into the building. The LSTM provides a self-contained estimator of a building’s entrance without relying on external sensor information on a user’s body or beacons placed inside a building’s lobby. [4]

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level, they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

==Dataset==

In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.

As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning was being used, the actual floor level was recorded manually for the validation purposes only.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|450px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|650px]] </div>

The above chart shows the success that their system is able to achieve in the floor level prediction.

The performance was measured in terms of how many floors were travelled rather than the absolute floor number. Because different buildings might have their floors differently numbered. They used different m values in 2 tests. One applies the same m value across all building and the other one applied specific m values on different buildings. The result showed that this specification on m values hugely increased the accuracy.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate the floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. Also, the article's ideas are sometimes out of order and are repeated in cycles.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm is implemented; otherwise, the algorithm may provide the misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Moreover, authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

The proposed model could introduce privacy risks such as illegal surveillance of mobile phone user and private facilities.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-12-01T04:20:16Z

J385chen:

=Introduction=

In highly populated cities with many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem is in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feed-forward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is an algorithm, which uses the output of the trained LSTM, to predict change in the barometric pressure of the smartphone from when it first entered the building against that of its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

The model does not rely on the external sensors placed inside the building, prior knowledge of the building, nor user movement behaviour. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations.

All the codes and data related to this article are available here[[https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data]]

=Related Work=

In general, previous work falls under two categories. The first category of methods is the classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is to achieve the high accuracy that the user's step size is needed, therefore heavily relying on pre-training to the specific users. In a real world application of this method, this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data, the authors developed an iOS application (called Sensory) that runs on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength, GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second, and each datum contained the different sensor measurements mentions earlier along with environment contexts like building floors, environment activity, city name, country name, and magnetic strength.

The data collection procedure for indoor-outdoor classifier was described as follows:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV for analysis. This procedure can start either outside or inside a building without loss of generality.

The following procedure generates data used to predict a floor change from the entrance floor to the end floor:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9) Save data as CSV for analysis.

Their algorithm was used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure (<math>P</math>), GPS vertical accuracy (<math>GV</math>), GPS horizontal accuracy (<math>GH</math>), GPS speed (<math>S</math>), device RSSI level (<math>rssi</math>), and magnetometer total reading (<math>M</math>).

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= M = \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|750px]] </div>

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|450px]] </div>

The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|250px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|450px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In previous work (Song et al., 2014) suggested the use of a reference barometer or beacons as a way to determine the entrances to a building.

The authors' second key contribution is to use the LSTM IO predictions to help our system identify these indoor transitions into the building. The LSTM provides a self-contained estimator of a building’s entrance without relying on external sensor information on a user’s body or beacons placed inside a building’s lobby.

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level, they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

==Dataset==

In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.

As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning was being used, the actual floor level was recorded manually for the validation purposes only.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|450px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|650px]] </div>

The above chart shows the success that their system is able to achieve in the floor level prediction.

The performance was measured in terms of how many floors were travelled rather than the absolute floor number. Because different buildings might have their floors differently numbered. They used different m values in 2 tests. One applies the same m value across all building and the other one applied specific m values on different buildings. The result showed that this specification on m values hugely increased the accuracy.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate the floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. Also, the article's ideas are sometimes out of order and are repeated in cycles.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm is implemented; otherwise, the algorithm may provide the misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Moreover, authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

The proposed model could introduce privacy risks such as illegal surveillance of mobile phone user and private facilities.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Countering Adversarial Images Using Input Transformations

2018-12-01T04:14:56Z

J385chen:

The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]

==Motivation ==
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.

[[File:Panda.png|center]]

==Introduction==
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier.
Generally, defenses against adversarial examples fall into two main categories:

# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm.
# Model-Agnostic – They try to remove adversarial perturbations from the input.

Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:

# Image Cropping and Re-scaling (Graese et al, 2016).
# Bit Depth Reduction (Xu et al, 2017)
# JPEG Compression (Dziugaite et al, 2016)
# Total Variance Minimization (Rudin et al, 1992)
# Image Quilting (Efros & Freeman, 2001).

These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack.

The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:
# remove the adversarial perturbations from input images,
# maintain sufficient information in input images to correctly classify them,
# and are still effective in situations where the adversary has information about the defense strategy being used.

From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.

==Previous Work==
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.

==Terminology==

'''Gray Box Attack''' : Model Architecture and parameters are public.

'''Black Box Attack''': Adversary does not have access to the model.

An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]

'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.

This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:
[[File:non-targeted O.JPG| 600px|center]]

'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.

This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:
[[File:Targeted O.JPG| 600px|center]]

'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).

== Problem Definition ==
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.

From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).

The success rate of an attack is given as:

<center><math>
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) ≠ h({x_n}^\prime)],
</math></center>

which is the proportions of predictions that were altered by an attack.

The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math>

A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.

In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack.

However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient black-box attack.

As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but
is unaware of the defence strategy that is being used.

A defence is an approach that aims make the prediction on an adversarial example <math>h(x^')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, teh authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x^'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.

==Adversarial Attacks==

Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack.

For the experimental purposes, below 4 attacks have been studied in the paper:

1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:

<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math>

for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.

2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:

<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math>

where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.

Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.

3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:

[[File:DeepFool.PNG|400px |]]

4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:

[[File:Carlini.PNG|500px |]]

As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.

All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping.

Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.

[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]

==Defenses==
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.
Five image transformations that alter the structure of these perturbations have been studied:
# Image Cropping and Re-scaling,
# Bit Depth Reduction,
# JPEG Compression,
# Total Variance Minimization,
# Image Quilting.

'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.

'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.

'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments

'''Total Variance Minimization (Rudin et. al) [9]''' :
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected
set of pixels, whilst also being “simple” in terms of total variation by solving:

[[File:TV!.png|300px|]] ,

where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :

[[File:TV2.png|500px|]]

The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.

[[File:tvx.png]]

The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.

'''Image Quilting (Efros & Freeman, 2001) [8]'''
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.

If we take a look at the effect of image quilting in the above figure, although interpretation of these images is more complicated due to the quantization errors that image quilting introduces, we can still observe that the absolute differences between quilted original and the quilted adversarial image appear to be smaller in non-homogeneous regions of the image. Based on this observation the authors suggest that TV minimization and image quilting lead to inherently different defenses.

=Experiments=

Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work.

'''Set up:'''
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:

- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.

- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.

- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.

- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation

The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.

Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3
[[File:models3.png |center]]

==GrayBox - Image Transformation at Test Time==
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.

[[File:sFig4.png|center|600px |]]

==BlackBox - Image Transformation at Training and Test Time==
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.

[[File:sFig5.png|center|600px |]]

==Blackbox - Ensembling==
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.

[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]

==GrayBox - Image Transformation at Training and Test Time ==
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.

[[File:sFig6.png|center| 600px |]]

==Comparison With Ensemble Adversarial Training==
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.

[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]

=Discussion/Conclusions=
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property.

Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model.

Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.

=Critiques=
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.

2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.

3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.

4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.

5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".

This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.

=References=

1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations

2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.

3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016.

4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.

5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017.

6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.

7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3

8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.

9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.

10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.

11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.

12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017

13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.

14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.

15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.

16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.

17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.

Countering Adversarial Images Using Input Transformations

2018-12-01T04:08:54Z

J385chen:

The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]

==Motivation ==
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.

[[File:Panda.png|center]]

==Introduction==
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier.
Generally, defenses against adversarial examples fall into two main categories:

# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm.
# Model-Agnostic – They try to remove adversarial perturbations from the input.

Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:

# Image Cropping and Re-scaling (Graese et al, 2016).
# Bit Depth Reduction (Xu et al, 2017)
# JPEG Compression (Dziugaite et al, 2016)
# Total Variance Minimization (Rudin et al, 1992)
# Image Quilting (Efros & Freeman, 2001).

These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack.

The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:
# remove the adversarial perturbations from input images,
# maintain sufficient information in input images to correctly classify them,
# and are still effective in situations where the adversary has information about the defense strategy being used.

From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.

==Previous Work==
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.

==Terminology==

'''Gray Box Attack''' : Model Architecture and parameters are public.

'''Black Box Attack''': Adversary does not have access to the model.

An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]

'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.

This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:
[[File:non-targeted O.JPG| 600px|center]]

'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.

This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:
[[File:Targeted O.JPG| 600px|center]]

'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).

== Problem Definition ==
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.

From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).

The success rate of an attack is given as:

<center><math>
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) ≠ h({x_n}^\prime)],
</math></center>

which is the proportions of predictions that were altered by an attack.

The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math>

A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.

In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack.

However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient black-box attack.

As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but
is unaware of the defence strategy that is being used.

A defence is an approach that aims make the prediction on an adversarial example <math>h(x^')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, teh authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x^'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.

==Adversarial Attacks==

Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack.

For the experimental purposes, below 4 attacks have been studied in the paper:

1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:

<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math>

for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.

2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:

<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math>

where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.

Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.

3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:

[[File:DeepFool.PNG|400px |]]

4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:

[[File:Carlini.PNG|500px |]]

As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.

All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping.

Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.

[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]

==Defenses==
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.
Five image transformations that alter the structure of these perturbations have been studied:
# Image Cropping and Re-scaling,
# Bit Depth Reduction,
# JPEG Compression,
# Total Variance Minimization,
# Image Quilting.

'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.

'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.

'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments

'''Total Variance Minimization (Rudin et. al) [9]''' :
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected
set of pixels, whilst also being “simple” in terms of total variation by solving:

[[File:TV!.png|300px|]] ,

where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :

[[File:TV2.png|500px|]]

The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.

[[File:tvx.png]]

The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.

'''Image Quilting (Efros & Freeman, 2001) [8]'''
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.

=Experiments=

Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work.

'''Set up:'''
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:

- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.

- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.

- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.

- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation

The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.

Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3
[[File:models3.png |center]]

==GrayBox - Image Transformation at Test Time==
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.

[[File:sFig4.png|center|600px |]]

==BlackBox - Image Transformation at Training and Test Time==
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.

[[File:sFig5.png|center|600px |]]

==Blackbox - Ensembling==
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.

[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]

==GrayBox - Image Transformation at Training and Test Time ==
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.

[[File:sFig6.png|center| 600px |]]

==Comparison With Ensemble Adversarial Training==
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.

[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]

=Discussion/Conclusions=
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property.

Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model.

Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.

=Critiques=
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.

2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.

3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.

4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.

5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".

This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.

=References=

1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations

2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.

3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016.

4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.

5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017.

6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.

7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3

8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.

9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.

10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.

11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.

12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017

13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.

14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.

15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.

16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.

17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.

stat946w18/Wavelet Pooling For Convolutional Neural Networks

2018-12-01T04:05:40Z

J385chen:

=Wavelet Pooling For Convolutional Neural Networks=

[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]

== Introduction, Important Terms and Brief Summary==

This paper focuses on the following important techniques:

1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs rather than vector-based features and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances.

2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting.

Some of the pooling methods, including max pooling and average pooling, are deterministic. Deterministic pooling methods are efficient and simple, but can hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. The neighborhood approach is used in all the mentioned pooling methods due to its simplicity and efficiency. Nevertheless, the approach can cause edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.

For further information on wavelets, follow this link to MathWorks' [https://www.mathworks.com/videos/understanding-wavelets-part-1-what-are-wavelets-121279.html Understanding Wavelets] video series.

== Intuition ==

Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth out or 'dilute' details in features.

Pooling is often introduced within networks to ensure local invariance to prevent overfitting due to small transitional shifts within an image. Despite the effectiveness of traditional pooling methods such as max pooling introduce this translational invariance by discarding information using methods analogous to nearest neighbour interpolation. With the hope of providing a more organic way of pooling, the authors leverage all information within cells inputted within a pooling operation with the hope that the resulting dim-reduced features are able to contain information from all high level cells using various dot products.

== History ==

A history of different pooling methods have been introduced and referenced in this study:
* Manual subsampling at 1979
* Max pooling at 1992
* Mixed pooling at 2014
* Pooling methods with probabilistic approaches at 2014 and 2015

== Background ==
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. These pooling methods reduce input data dimensionality by taking the maximum value or the average value of specific areas and condense them into one single value. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:

'''Limitations of Max Pooling and Average Pooling'''

'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:

\begin{align}
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})
\end{align}

'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:

\begin{align}
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}
\end{align}

Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 2 provides an example of the weaknesses of these two methods using toy images:

[[File: fig0001.PNG| 700px|center]]

'''How the researchers try to '''combat these issues'''?'''
Using '''probabilistic pooling methods''' such as:

1. '''Mixed pooling''': In general, when facing a new problem in which one would want to use a CNN, it is unintuitive to whether average or max-pooling is preferred. Notably, both techniques have significant drawbacks. Average pooling forces the network to consider low magnitude (and possibly irrelevant information) in constructing representations, while max pooling can force the network to ignore fundamental differences between neighbouring groups of pixels. To counteract this, mixed pooling probabilistically decides which to use during training / testing. It should be noted that, for training, it is only probabilistic in the forward pass. During back-propagation the network defaults to the earlier chosen method. Mixed pooling can be applied in 3 different ways.

* For all features within a layer
* Mixed between features within a layer
* Mixed between regions for different features within a layer

Mixed Pooling is defined as:

\begin{align}
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}
\end{align}

Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling.

2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:

\begin{align}
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})
\end{align}

with probability of activations within each region defined as follows:

\begin{align}
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}
\end{align}

The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected.

[[File: stochastic pooling.jpeg| 700px|center]]

As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.

3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant.
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/

'''Wavelets and Wavelet Transform'''
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.

The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.

One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occuring. Thus, basis functions must be chosen with this tradeoff in mind.

Source: Compressing still and moving images with wavelets

== Proposed Method ==

The previously highlighted pooling methods use neighborhoods to subsample, almost identical to nearest neighbor interpolation.

The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression. The authors say that this organic reduction therefore lessens the creation of jagged edges and other artifacts that may impede correct image classification.

* '''Forward Propagation'''

The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:

\begin{align}
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}
\end{align}

\begin{align}
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}
\end{align}

where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level

When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based off the inverse DWT (IDWT).

\begin{align}
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}
\end{align}

[[File: wavelet pooling forward.PNG| 700px|center]]

* '''Backpropagation'''

The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:

[[File:wavelet pooling backpropagation.PNG| 700px|center]]

== Results and Discussion ==

All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:

[[File: selection of image datasets.PNG| 700px|center]]

Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.

* MNIST:

The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.

[[File: CNN MNIST.PNG| 700px|center]]

The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.

[[File: MNIST pooling method energy.PNG| 700px|center]]

The accuracies for both paradigms are shown below:

[[File: MNIST perf.PNG| 700px|center]]

* CIFAR-10:

The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes.

[[File: CNN CIFAR.PNG| 700px|center]]

The input training and test data come from the CIFAR-10 dataset.
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.

[[File: fig0000.jpg| 700px|center]]

Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintain a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:

[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]

* SHVN:

Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:

[[File: CNN SHVN.PNG| 700px|center]]

The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.

[[File: SHVN perf.PNG| 700px|center]]

Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.

[[File: SHVN pooling method energy.PNG| 700px|center]]

* KDEF:

They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:

[[File:CNN KDEF.PNG| 700px|center]]

The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).

This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.

The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.
The figure below shows the energy of each method per epoch.

[[File: KDEF pooling method energy.PNG| 700px|center]]

The accuracies for both paradigms are shown below:

[[File: KDEF perf.PNG| 700px|center]]

* Computational Complexity:
Above experiments and implementations on wavelet pooling were more of a proof-of-concept rather than an optimized method. In terms of mathematical operations, the wavelet pooling method is the least computationally efficient compared to all other pooling methods mentioned above. Among all the methods, average pooling is the most efficient methods, max pooling and mix pooling are at a similar level while wavelet pooling is way more expensive to complete the calculation.

== Conclusion ==

They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.

The authors' results confirm previous studies proving that no one pooling method is superior, but some perform better than others depending on the dataset and network structure Boureau et al. (2010); Lee et al. (2016). Furthermore, many networks alternate between different pooling methods to maximize the effectiveness of each method. [1]

Future work and improvements in this area could be to vary the wavelet basis to explore which basis performs best for the pooling. Altering the upsampling and downsampling factors in the decomposition and reconstruction can lead to better image feature reductions outside of the 2x2 scale. Retention of the subbands we discard for the backpropagation could lead to higher accuracies and fewer errors. Improving the method of FTW we use could greatly increase computational efficiency. Finally, analyzing the structural similarity (SSIM) of wavelet pooling versus other methods could further prove the vitality of using the authors' approach. [1]

== Suggested Future work ==

Upsampling and downsampling factors in decomposition and reconstruction needs to be changed to achieve more feature reduction.
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.

== Critiques and Suggestions ==
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study!
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.
* Adding asymptotic notations to the computational complexity of the proposed algorithm would be meaningful, particularly since the given results are for a single/fixed input size (one image in forward propagation) and consequently are not generalizable.
* They could have considered comparing against Fast Fourier Transform (FFT). Including a non wavelet form seems to be an obvious candidate for comparison
* If they went beyond the 2x2 pooling window this would have further supported their method
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) The experiments are largely conducted with very small scale datasets. As a result, I am not sure if they are representative enough to show the performance difference between different pooling methods.
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) No comparison to non-wavelet methods. For example, one obvious comparison would have been to look at using a DCT or FFT transform where the output would discard high-frequency components (this can get very close to the wavelet idea!). Also, this critique might provides us with some interesting research directions since DCT or FFT transforms as pooling are not throughly studied yet.
* Also, convolutional neural network are not only used in image related tasks. Evaluating the efficiency of wavelet pooling in convolutional neural network applied to natural languages or other applicable areas will be interesting. Such experiments shall also show if such approach can be generalized.

== References ==

Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).

Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.

== Revisions ==

*Two reviewers really liked the paper and one of them called it in the top 15% papers in the conference which supports the novelty and potential of the idea. One other reviewer, however, believed that this was not good enough to be accepted and the main reason for rejection was the linearity nature of wavelet(which was not convincingly described).

*The main concern of two of the reviewers has been the size of the datasets that have been used to test the method and the authors have mentioned future works concerning bigger datasets to test the method.

*The computational cost section had not been included in the paper initially and was added after one of the reviewer's concern. So, the other reviewers have not been curious about this and unfortunately, there is no comment on that from them. However, the description on the non-efficient implementation seemed to be satisfactory to the reviewer which resulted in being accepted.

[https://openreview.net/forum?id=rkhlb8lCZ Revisions]

At the end, if you are interested in implementing the method, they are willing to share their code but after making it efficient. So, maybe there will be another paper regarding less computational cost on larger datasets with a publishable code.

Fairness Without Demographics in Repeated Loss Minimization

2018-12-01T03:59:03Z

J385chen:

This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018.

=Introduction=

Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. For example, non-native speakers may contribute less to the speech recognizer machine learning model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity further widens for minority users who suffer higher error rates, as they will lower usage of the system in the future. As a result, minority groups provide even less data for future optimization of the model. With less data points to work with, the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''.

[[File:fairness_example.JPG|700px|center]]

In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk amongst all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity . This representation disparity is further amplified over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed, Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time, and is fair for models that ERM turns unfair by applying it to Amazon Mechanical Turk task.

===Note on Fairness===

Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group, rather than the whole group (cf. utilitarianism).

===Related Works===
The recent advancements on the topic of fairness in Machine learning can be classified into the following approaches:

1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time, which increases the chance that minorities will consent to status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.

2. Labels of minorities present in the data:
* Chouldechova, 2017: Use of race (a protected label) in recidivism protection. This study evaluated the likelihood for a criminal defendant to reoffend at a later time, which assisted with criminal justice decision-making. However, a risk assessment instrument called COMPAS was studied and discovered to be biased against black defendants. As the consequences for misclassification can be dire, fairness regarding using race as a label was studied.
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.
In the case specific to this paper, this information is not present.

3. Fairness when minority grouping are not present explicitly
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function whereas Kearns et al., 2018; Hebert-Johnson et al., 2017 used subgroups of a set of protected labels.
* Rawlsian Fairness for Machine Learning, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel †Aaron Roth November 1, 2016
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.
Again for the specific case in this paper, this is not possible.

4. Online settings
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.

=Representation Disparity=

If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>.

The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>.

If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution, and we assume these two variables are unknown.

The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.

The worst-case risk over all groups can then be defined as,
\begin{align}
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).
\end{align}

Minimizing this function is equivalent to minimizing the risk for the worst-off group.

There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).

Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.

=Disparity Amplification=

Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses.

At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by
\begin{align}
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k
\end{align}

where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>.

Furthermore, the group proportions <math display="inline">\alpha_k^{(t)}</math>, dependent on past losses is defined as:
\begin{align}
\alpha_k^{(t+1)} := \dfrac{\lambda_k^{(t+1)}}{\sum_{k'\in[K]} \lambda_{k'}^{(t+1)}}
\end{align}

To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. The decrease in user retention for the minority group is exacerbated over time since once a group shrinks sufficiently, it receives higher losses relative to others, leading to even fewer samples from the group.

==Empirical Risk Minimization (ERM)==

Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:

\begin{align}
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})
\end{align}

However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.

=Distributionally Robust Optimization (DRO)=

To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this section's formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known, which means the data was sampled from different unknown groups. Therefore, in order to improve the performance across different groups, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss.

To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. If <math display="inline">P</math> is not absolutely continuous w.r.t <math display="inline">Q</math>, then <math display="inline">D_{\chi^2} (P || Q):= \infty</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by

\begin{align}
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]
\end{align}

which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.

==Optimization of DRO==

To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem

\begin{align}
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}
\end{align}

with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.

Pros of DRO: In many cases, the expected value is a good measure of performance
Cons of DRO: One has to know the exact distribution of the underlying distribution to perform the stochastic optimization. Deviant from the assumed distribution may result in sub-optimal solutions. The paper makes strong assumptions on <math>\mathcal{P}</math> with respect to group allocations, and thus requires a high amount of data to optimize; when assumptions are violated, the algorithm fails to perform as intended.

=Experiments=

The paper demonstrate the effectiveness of DRO and human evaluation of a text autocomplete system on Amazon Mechanical Turk. In both cases, DRO controls the worst-case risk over time steps and improves minority retention.
Below Figure gives Inferred dynamics from a Mechanical Turk based evaluation of autocomplete systems.DRO increases minority (a) user
satisfaction and (b) retention, leading to a corresponding increase in (c) user count. Error bars indicates bootstrap quartiles.
[[File:fig4999.png|thumb|center|600px|]].

Below figure shows how Disparity amplification in corrected by DRO. Error bars indicate quartiles over 10 replicates.
[[File:fig5999.png|thumb|center|400px|]].

Below figure shows Classifier accuracy as a function of group imbalance.Dotted lines show accuracy on majority group.
[[File:fig6999.png|thumb|center|400px|]].

It is a surprising result that the minority group has higher satisfaction and retention under DRO. Analysis of long-form comments from Turkers attribute this phenomenon to to users valuing
the model’s ability to complete slang more highly than completion of common words, and indicates a slight mismatch between the authors' training loss and human satisfaction with an autocomplete system.

=Critiques=

This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at are hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.

Note: The first assumption about <math>\eta</math> can be weakened by introducing discrete yet smooth enough function for computational proposes only. Such function will be enough to mimic for differentiability.

=Other Sources=
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.
# [https://vimeo.com/295743125] a video of the authors explaining the paper in ICML 2018

=References=
Rawls, J. Justice as fairness: a restatement. Harvard University Press, 2001.

Barocas, S. and Selbst, A. D. Big data’s disparate impact. 104 California Law Review, 3:671–732, 2016.

Chouldechova, A. A study of bias in recidivism prediction instruments. Big Data, pp. 153–163, 2017

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp. 214–226, 2012.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2018.

Hebert-Johnson, ´ U., Kim, M. P., Reingold, O., and Roth-blum, G. N. Calibration for the (computationally identifiable) masses. arXiv preprint arXiv:1711.08513, 2017.

Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. In FATML, 2016.

Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1617–1626, 2017.

Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.

Fairness Without Demographics in Repeated Loss Minimization

2018-12-01T03:51:45Z

J385chen:

This page contains the summary of the paper "[http://proceedings.mlr.press/v80/hashimoto18a.html Fairness Without Demographics in Repeated Loss Minimization]" by Hashimoto, T. B., Srivastava, M., Namkoong, H., & Liang, P. which was published at the International Conference of Machine Learning (ICML) in 2018.

=Introduction=

Usually, machine learning models are minimized in their average loss to achieve high overall accuracy. While this works well for the majority, minority groups that use the system suffer high error rates because they contribute fewer data to the model. For example, non-native speakers may contribute less to the speech recognizer machine learning model. This phenomenon is known as '''''representation disparity''''' and has been observed in many models that, for instance, recognize faces, or identify the language. This disparity further widens for minority users who suffer higher error rates, as they will lower usage of the system in the future. As a result, minority groups provide even less data for future optimization of the model. With less data points to work with, the disparity becomes even worse - a phenomenon referred to as '''''disparity amplification'''''.

[[File:fairness_example.JPG|700px|center]]

In this paper, Hashimoto et al. provide a strategy for controlling the worst case risk amongst all groups. They first show that standard '''''empirical risk minimization (ERM)''''' does not control the loss of minority groups, and thus causes representation disparity . This representation disparity is further amplified over time (even if the model is fair in the beginning). Second, the researchers try to mitigate this unfairness by proposing the use of '''''distributionally robust optimization (DRO)'''''. Indeed, Hashimoto et al. are able to show that DRO can bound the loss for minority groups at every step of time, and is fair for models that ERM turns unfair by applying it to Amazon Mechanical Turk task.

===Note on Fairness===

Hashimoto et al. follow the ''difference principle'' to achieve and measure fairness. It is defined as the maximization of the welfare of the worst-off group, rather than the whole group (cf. utilitarianism).

===Related Works===
The recent advancements on the topic of fairness in Machine learning can be classified into the following approaches:

1. Rawls Difference principle (Rawls, 2001, p155) - Defines that maximizing the welfare of the worst-off group is fair and stable over time, which increases the chance that minorities will consent to status-quo. The current work builds on this as it sees predictive accuracy as a resource to be allocated.

2. Labels of minorities present in the data:
* Chouldechova, 2017: Use of race (a protected label) in recidivism protection. This study evaluated the likelihood for a criminal defendant to reoffend at a later time, which assisted with criminal justice decision-making. However, a risk assessment instrument called COMPAS was studied and discovered to be biased against black defendants. As the consequences for misclassification can be dire, fairness regarding using race as a label was studied.
* Barocas & Selbst, 2016: Guaranteeing fairness for a protected label through constraints such as equalized odds, disparate impact, and calibration.
In the case specific to this paper, this information is not present.

3. Fairness when minority grouping are not present explicitly
* Dwork et al., 2012 used Individual notions of fairness using fixed similarity function whereas Kearns et al., 2018; Hebert-Johnson et al., 2017 used subgroups of a set of protected labels.
* Rawlsian Fairness for Machine Learning, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel †Aaron Roth November 1, 2016
* Kearns et al. (2018); Hebert-Johnson et al. (2017) consider subgroups of a set of protected features.
Again for the specific case in this paper, this is not possible.

4. Online settings
* Joseph et al., 2016; Jabbari et al., 2017 looked at fairness in bandit learning using algorithms compatible with Rawls’ principle on equality of opportunity.
* Liu et al. (2018) analyzed fairness temporally in the context of constraint-based fairness criteria. It showed that fairness is not ensured over time when static fairness constraints are enforced.

=Representation Disparity=

If a user makes a query <math display="inline">Z \sim P</math>, the model <math display="inline">\theta \in \Theta</math> makes a prediction, and the user experiences loss <math display="inline">\ell (\theta; Z)</math>.

The expected loss of a model <math display="inline">\theta</math> is denoted as the risk <math display="inline">\mathcal{R}(\theta) = \mathbb{E}_{Z \sim P} [\ell (\theta; Z)] </math>.

If input queries are made by users from <math display="inline">K</math> groups, then the distribution over all queries can be re-written as <math display="inline">Z \sim P := \sum_{k \in [K]} \alpha_kP_k</math>, where <math display="inline">\alpha_k</math> is the population portion of group <math display="inline">k</math> and <math display="inline">P_k</math> is its individual distribution, and we assume these two variables are unknown.

The risk associated with group <math>k</math> can be written as, <math>\mathcal{R}_k(\theta) := \mathbb{E}_{P_k} [\ell (\theta; Z)]</math>.

The worst-case risk over all groups can then be defined as,
\begin{align}
\mathcal{R}_{max}(\theta) := \underset{k \in [K]}{max} \mathcal{R}_k(\theta).
\end{align}

Minimizing this function is equivalent to minimizing the risk for the worst-off group.

There is high representation disparity if the expected loss of the model <math display="inline">\mathcal{R}(\theta)</math> is low, but the worst-case risk <math display="inline">\mathcal{R}_{max}(\theta)</math> is high. A model with high representation disparity performs well on average (i.e. has low overall loss), but fails to represent some groups <math display="inline">k</math> (i.e. the risk for the worst-off group is high).

Often, groups are latent and <math display="inline">k, P_k</math> are unknown and the worst-case risks are inaccessible. The technique proposed by Hashimoto et al does not require direct access to these.

=Disparity Amplification=

Representation disparity can amplify as time passes and loss is minimized. Over <math display="inline">t = 1, 2, ..., T</math> minimization rounds, the group proportions <math display="inline">\alpha_k^{(t)}</math> are not constant, but vary depending on past losses.

At each round the expected number of users <math display="inline">\lambda_k^{(t+1)}</math> from group <math display="inline">k</math> is determined by
\begin{align}
\lambda_k^{(t+1)} := \lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)})) + b_k
\end{align}

where <math display="inline">\lambda_k^{(t)} \nu(\mathcal{R}_k(\theta^{(t)}))</math> describes the fraction of retained users from the previous optimization, <math>\nu(x)</math> is a function that decreases as <math>x</math> increases, and <math display="inline">b_k</math> is the number of new users of group <math display="inline">k</math>.

Furthermore, the group proportions <math display="inline">\alpha_k^{(t)}</math>, dependent on past losses is defined as:
\begin{align}
\alpha_k^{(t+1)} := \dfrac{\lambda_k^{(t+1)}}{\sum_{k'\in[K]} \lambda_{k'}^{(t+1)}}
\end{align}

To put simply, the number of expected users of a group depends on the number of new users of that group and the fraction of users that continue to use the system from the previous optimization step. If fewer users from minority groups return to the model (i.e. the model has a low retention rate of minority group users), Hashimoto et al. argue that the representation disparity amplifies. The decrease in user retention for the minority group is exacerbated over time since once a group shrinks sufficiently, it receives higher losses relative to others, leading to even fewer samples from the group.

==Empirical Risk Minimization (ERM)==

Without the knowledge of population proportions <math display="inline">\alpha_k^{(t)}</math>, the new user rate <math display="inline">b_k</math>, and the retention function <math display="inline">\nu</math> it is hard to control the worst-case risk over all time periods <math display="inline">\mathcal{R}_{max}^T</math>. That is why it is the standard approach to fit a sequence of models <math display="inline">\theta^{(t)}</math> by empirically approximating them. Using ERM, for instance, the optimal model is approached by minimizing the loss of the model:

\begin{align}
\theta^{(t)} = arg min_{\theta \in \Theta} \sum_i \ell(\theta; Z_i^{(t)})
\end{align}

However, ERM fails to prevent disparity amplification. By minimizing the expected loss of the model, minority groups experience higher loss (because the loss of the majority group is minimized), and do not return to use the system. In doing so, the population proportions <math display="inline">\alpha_k^{(t)}</math> shift, and certain minority groups contribute even less to the system. This is mirrored in the expected user count <math display="inline">\lambda^{(t)}</math> at each optimization point. In their paper Hashimoto et al. show that, if using ERM, <math display="inline">\lambda^{(t)}</math> is unstable because it loses its fair fixed point (i.e. the population fraction where risk minimization maintains the same population fraction over time). Therefore, ERM fails to control minority risk over time and is considered unfair.

=Distributionally Robust Optimization (DRO)=

To overcome the unfairness of ERM, Hashimoto et al. developed a distributionally robust optimization (DRO). At this point the goal is still to minimize the worst-case group risk over a single time-step <math display="inline">\mathcal{R}_{max} (\theta^{(t)}) </math> (time steps are omitted in this section's formulas). As previously mentioned, this is difficult to do because neither the population proportions <math display="inline">\alpha_k </math> nor group distributions <math display="inline">P_k </math> are known, which means the data was sampled from different unknown groups. Therefore, in order to improve the performance across different groups, Hashimoto et al. developed an optimization technique that is robust "against '''''all''''' directions around the data generating distribution". This refers to the notion that DRO is robust to any group distribution <math display="inline">P_k </math> whose loss other optimization techniques such as ERM might try to optimize. To create this distributionally robustness, the optimizations risk function <math display="inline">\mathcal{R}_{dro} </math> has to "up-weigh" data <math display="inline">Z</math> that cause high loss <math display="inline">\ell(\theta, Z)</math>. In other words, the risk function has to over-represent mixture components (i.e. group distributions <math display="inline">P_k </math>) in relation to their original mixture weights (i.e. the population proportions <math display="inline">\alpha_k </math>) for groups that suffer high loss.

To do this Hashimoto et al. considered the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> around <math display="inline">P</math> within a certain limit (because obviously not every outlier should be up-weighed). This limit is described by the <math display="inline">\chi^2</math>-divergence (i.e. the distance, roughly speaking) between probability distributions. For two distributions <math display="inline">P</math> and <math display="inline">Q</math> the divergence is defined as <math display="inline">D_{\chi^2} (P || Q):= \int (\frac{dP}{dQ} - 1)^2</math>. If <math display="inline">P</math> is not absolutely continuous w.r.t <math display="inline">Q</math>, then <math display="inline">D_{\chi^2} (P || Q):= \infty</math>. With the help of the <math display="inline">\chi^2</math>-divergence, Hashimoto et al. defined the chi-squared ball <math display="inline">\mathcal{B}(P,r)</math> around the probability distribution P. This ball is defined so that <math display="inline">\mathcal{B}(P,r) := \{Q \ll P : D_{\chi^2} (Q || P) \leq r \}</math>. With the help of this ball the worst-case loss (i.e. the highest risk) over all perturbations <math display="inline">P_k </math> that lie inside the ball (i.e. within reasonable range) around the probability distribution <math display="inline">P</math> can be considered. This loss is given by

\begin{align}
\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]
\end{align}

which for <math display="inline">P:= \sum_{k \in [K]} \alpha_k P_k</math> for all models <math display="inline">\theta \in \Theta</math> where <math display="inline">r_k := (1/a_k -1)^2</math> bounds the risk <math display="inline">\mathcal{R}_k(\theta) \leq \mathcal{R}_{dro} (\theta; r_k)</math> for each group with risk <math display="inline">\mathcal{R}_k(\theta)</math>. Furthermore, if the lower bound on the group proportions <math display="inline">\alpha_{min} \leq min_{k \in [K]} \alpha_k</math> is specified, and the radius is defined as <math display="inline">r_{max} := (1/\alpha_{min} -1)^2</math>, the worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math> can be controlled by <math display="inline">\mathcal{R}_{dro} (\theta; r_{max}) </math> by forming an upper bound that can be minimized.

==Optimization of DRO==

To minimize <math display="inline">\mathcal{R}_{dro}(\theta, r) := \underset{Q \in \mathcal{B}(P,r)}{sup} \mathbb{E}_Q [\ell(\theta;Z)]</math> Hashimoto et al. look at the dual of this maximization problem (i.e. every maximization problem can be transformed into a minimization problem and vice-versa). This dual is given by the minimization problem

\begin{align}
\mathcal{R}_{dro}(\theta, r) = \underset{\eta \in \mathbb{R}}{inf} \left\{ F(\theta; \eta):= C\left(\mathbb{E}_P \left[ [\ell(\theta;Z) - \eta]_+^2 \right] \right)^\frac{1}{2} + \eta \right\}
\end{align}

with <math display="inline">C = (2(1/a_{min} - 1)^2 + 1)^{1/2}</math>. <math display="inline">\eta</math> describes the dual variable (i.e. the variable that appears in creating the dual). Since <math display="inline">F(\theta; \eta)</math> involves an expectation <math display="inline">\mathbb{E}_P</math> over the data generating distribution <math display="inline">P</math>, <math display="inline">F(\theta; \eta)</math> can be directly minimized. For convex losses <math display="inline">\ell(\theta;Z)</math>, <math display="inline">F(\theta; \eta)</math> is convex, and can be minimized by performing a binary search over <math display="inline">\eta</math>. In their paper, Hashimoto et al. further show that optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> at each time step controls the ''future'' worst-case risk <math display="inline">\mathcal{R}_{max} (\theta) </math>, and therefore retention rates. That means if the initial group proportions satisfy <math display="inline">\alpha_k^{(0)} \geq a_{min}</math>, and <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> is optimized for every time step (and therefore <math display="inline">\mathcal{R}_{max} (\theta) </math> is minimized), <math display="inline">\mathcal{R}_{max}^T (\theta) </math> over all time steps is controlled. In other words, optimizing <math display="inline">\mathcal{R}_{dro}(\theta, r_{max})</math> every time step is enough to avoid disparity amplification.

Pros of DRO: In many cases, the expected value is a good measure of performance
Cons of DRO: One has to know the exact distribution of the underlying distribution to perform the stochastic optimization. Deviant from the assumed distribution may result in sub-optimal solutions. The paper makes strong assumptions on <math>\mathcal{P}</math> with respect to group allocations, and thus requires a high amount of data to optimize; when assumptions are violated, the algorithm fails to perform as intended.

=Experiments=

The paper demonstrate the effectiveness of DRO and human evaluation of a text autocomplete system on Amazon Mechanical Turk. In both cases, DRO controls the worst-case risk over time steps and improves minority retention.
Below Figure gives Inferred dynamics from a Mechanical Turk based evaluation of autocomplete systems.DRO increases minority (a) user
satisfaction and (b) retention, leading to a corresponding increase in (c) user count. Error bars indicates bootstrap quartiles.
[[File:fig4999.png|thumb|center|600px|]].

Below figure shows how Disparity amplification in corrected by DRO. Error bars indicate quartiles over 10 replicates.
[[File:fig5999.png|thumb|center|400px|]].

Below figure shows Classifier accuracy as a function of group imbalance.Dotted lines show accuracy on majority group.
[[File:fig6999.png|thumb|center|400px|]].

=Critiques=

This paper works on representational disparity which is a critical problem to contribute to. The methods are well developed and the paper reads coherently. However, the authors have several limiting assumptions that are not very intuitive or scientifically suggestive. The first assumption is that the <math display="inline">\eta</math> function denoting the fraction of users retained is differentiable and strictly decreasing function. This assumption does not seem practical. The second assumption is that the learned parameters are having a Poisson distribution. There is no explanation of such an assumption and reasons hinted at are hand-wavy at best. Though the authors are building a case against the Empirical risk minimization method, this method is exactly solvable when the data is linearly separable. The DRO method is computationally more complex than ERM and is not entirely clear if it will always have an advantage for a different class of problems.

Note: The first assumption about <math>\eta</math> can be weakened by introducing discrete yet smooth enough function for computational proposes only. Such function will be enough to mimic for differentiability.

=Other Sources=
# [https://blog.acolyer.org/2018/08/17/fairness-without-demographics-in-repeated-loss-minimization/] is a easy-to-read paper description.
# [https://vimeo.com/295743125] a video of the authors explaining the paper in ICML 2018

=References=
Rawls, J. Justice as fairness: a restatement. Harvard University Press, 2001.

Barocas, S. and Selbst, A. D. Big data’s disparate impact. 104 California Law Review, 3:671–732, 2016.

Chouldechova, A. A study of bias in recidivism prediction instruments. Big Data, pp. 153–163, 2017

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp. 214–226, 2012.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2018.

Hebert-Johnson, ´ U., Kim, M. P., Reingold, O., and Roth-blum, G. N. Calibration for the (computationally identifiable) masses. arXiv preprint arXiv:1711.08513, 2017.

Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Rawlsian fairness for machine learning. In FATML, 2016.

Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1617–1626, 2017.

Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018.

Visual Reinforcement Learning with Imagined Goals

2018-12-01T03:33:16Z

J385chen: /* Variational Autoencoder */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus are able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract (self-generated) goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarized below. A Variational Auto Encoder (VAE) on the (left) learns a latent representation of images gathered during training time (center). These latent variables are used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png |center| 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviors such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al. [11] proposed time-varying models which require episodic setups and thus are hard to generalize to non-episodic and continuous learning scenarios. There are also other works such as Pinto et al. [12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] use fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. The authors' experiments indicate that this technique is difficult to extend to goal-conditioned setting
with image inputs. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions. This approach allows for a single transition tuple to be converted into potentially infinite valid training examples.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math> (desired state), can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> is the reward for performing action <math>a_i</math> when the current state is <math>s_i</math> and the goal state is <math>s_{i+1}</math> and <math>\gamma</math> is a discount factor which determines the relative importance given to rewards at different times.

In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Suppose we let an autonomous agent explore an environment with a random policy. After executing each action, start and stop state observations are collected and stored. All state observations are images. For training, the agent can randomly select starting states and goals images from the set of state observations.

Moreover, if we aim to accomplish a variety of tasks, we can construct a goal-conditioned policy and reward, and optimize the expected return with respect to a goal distribution

<center><math>E_{g \sim G}[E_{r_i,s_i \sim E, a_i \sim \pi}[R_0]]</math></center>

where <math>G</math> is the set of goals and the reward is also a function of <math>g</math>

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that a chosen value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to the goal state.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, <math> (s,a,s′)</math> . However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns, as the task of generating goal images is fairly intensive.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between the two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

Retroactively generating goals is also explored in tabular domains in [15]and in continuous domains in [14] using hindsight experience replay (HER). However, HER is
limited to sampling goals seen along a trajectory, which greatly limits the number and diversity of goals with which one can relabel a given transition.

=Variational Autoencoder=
Variational autoencoders can learn structured latent representations of high dimensional data. VAE contains an encoder <math>p_\phi</math> and a decoder <math>p_\psi</math>. The former maps states to latent distributions, while the later maps latents to distributions over states. these two are jointly trained to maximize:

<math>L(\psi,\phi;s^{(i)})=-\beta D_{KL}(q_\phi(z|s^{(i)}||p(z))+E_{q\phi(z|s^(i))}[log p_\psi(s^{(i)})|z])</math>

where p(z) is a prior distribution, which is chosen to be unit Gaussian, <math>D_{KL}</math> is the Kullback-Leibler divergence, and <math>\beta</math> is a hyper-parameter that balances the two terms.

This generative model
converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images result in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the equation

<center><math display="inline"> r(s, g) = - || z - z_g ||_A \propto \sqrt{log(e_{\Phi}(z_g | s))} </math></center>.

This equation is derived from the equation below. This is based on the choice to use the negative Mahalanobis distance in the latent space for the reward:

<center><math display="inline"> r(s, g) = - || e(s) - e(g) ||_A = - || z - z_g ||_A </math></center>

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The Oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward. As learning proceeds, RIG makes steady progress at optimizing the latent distance.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. A new paper [10] was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blurry to be used goal images. It would be better if this can be investigated in the future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spend moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

14. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In
Advances in Neural Information Processing Systems (NIPS) 2017.

15. L P Kaelbling. Learning to achieve goals. In IJCAI-93. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, volume vol.2, pages 1094 – 8, 1993.

learn what not to learn

2018-12-01T03:06:11Z

J385chen: /* Reference */

=Introduction=

In reinforcement learning, it is often difficult for an agent to learn when the action space is large, especially the difficulties from function approximation and exploration. Some previous work has been trying to use Monte Carlo Tree Search to help address this problem. Monte Carlo Tree Search is a heuristic search algorithm that helps provides some indication of how good is an action, it works relatively well in a problem where the action space is large(like the one in this paper). One of the famous examples would be Google's Alphago that defeated the world champion in 2016, which uses MCTS in their reinforcement learning algorithm for the board game Go. When the action space is large, one com In some cases many actions are irrelevant and it is sometimes easier for the algorithm to learn which action not to take. The paper proposes a new reinforcement learning approach for dealing with large action spaces based on action elimination by restricting the available actions in each state to a subset of the most likely ones. There is a core assumption being made in the proposed method that it is easier to predict which actions in each state are invalid or inferior and use that information for control. More specifically, it proposes a system that learns the approximation of a Q-function and concurrently learns to eliminate actions. The method utilizes an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played (e.g., Player: "Climb the tree." Parser: "There are no trees to climb"). Then a machine learning model can be trained to generalize to unseen states.

The paper focuses on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a Deep Q-Network (DQN) and an Action Elimination Network (AEN), both using the Convolutional Neural Networks (CNN) for Natural Language Processing (NLP) tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. The proposed method uses the final layer activations of AEN to build a linear contextual bandit model which allows the elimination of sub-optimal actions with high probability. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which lets players to interact with a virtual world through a text-based interface is tested by using the elimination framework.
In this game, the player explores an environment using imagination of the text he/she reads. For more info, you can watch this video: [https://www.youtube.com/watch?v=xzUagi41Wo0 Zork].

The AEN algorithm has achieved a faster learning rate than the baseline agents by eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:lnottol_fig1.png|500px|center]]

All states and actions are given in natural language. Input for the game contains more than a thousand possible actions in each state since the player can type anything.

=Related Work=

'''Text-Based Games(TBG):''' The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction, and common sense. It also often introduce stochastic dynamics to increase randomness. They highlight several open problems for RL, mainly the combinatorial and compositional properties of the action space, and game states that are only partially observable.

'''Representations for TBG:''' Good word representation is necessary in order to learn control policies from high-dimensional complex data such as text. Previous work on TBG used pre-trained embeddings directly for control, other works combined pre-trained embedding with neural networks. For example, He
et al. (2015) proposed to consider an input as Bag Of Words features for a neural network, learned separately
embeddings for states and actions, and then computed the Q function from autocorrelations between
these embeddings.

'''DRL with linear function approximation:''' DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.A natural attempt at getting the best of both worlds is to learn a linear control policy on top of the representation of the last layer of a DNN. This approach was shown to refine the performance of DQNs and improve exploration. Similarly, for contextual linear bandits, Riquelme et al. showed that a neuro-linear Thompson sampling approach outperformed deep
and linear bandit algorithms in practice.

'''RL in Large Action Spaces:''' Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

The approach in the paper builds on the standard Reinforcement Learning formulation. At each time step <math>t</math>, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then, after action execution, the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and observes next state <math display="inline">s_{t+1} </math> according to a transition kernel <math>P(s_{t+1}|s_t,a_t)</math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discounted cumulative return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]</math>, where <math> 0< \gamma <1 </math>. The Q-function is <math display="inline">Q^\pi(s,a)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s,a_0=a]</math>, and it can be optimized by Q-learning algorithm.

After executing an action, the agent observes a binary elimination signal <math>e(s, a)</math> to determine which actions not to take. It equals 1 if action <math>a</math> may be eliminated in state <math>s</math> (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

The set of valid state-action pairs contains all of the state-action pairs that are a part of some optimal policy, i.e., only strictly suboptimal state-actions can be invalid.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

==Advantages of Action Elimination==

The main advantage of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigates this effect by taking the max operator only on valid actions, thus, reducing potential overestimation errors. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping (i.e., only the Q-values of the valid state-action pairs) leading to faster convergence and better solution.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. Assume that there are <math>A'</math> actions that should be eliminated and are <math>\epsilon</math>-optimal, i.e. their value is at least <math>V^*(s)-\epsilon</math>. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

Because it is difficult to embed the elimination signal into the MDP, the authors use contextual multi-armed bandits to decouple the elimination signal from the MDP, which can correctly eliminate actions when applying standard Q learning into learning process.

==Action elimination with contextual bandits==

Contextual bandit problem is a famous probability problem and is a natural extension from the multi-arm bandit problem.

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in \mathbb{R}^d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^{*T}x(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}\ \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2\ \text{log}(\text{det}(\bar{V}_{t,a})^{1/2}\text{det}(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s\ ,\Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{d\ \text{log}(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t-1,a}^{T}x(s_t)-\theta_{t-1, a}^{*T}x(s_t)|\leq\sqrt{\beta_t(\tilde\delta)x(s_t)^T\bar{V}_{t - 1,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t-1,a}^{T}x(s_t)-\sqrt{\beta_{t-1}(\tilde\delta)x(s_t)^T\bar{V}_{t-1,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=

The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> generally does not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as feature representation of states. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:lnottol_fig1b.png|300px|center]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent uses it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:lnottol_fig2.png|750px|center]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiments=
==Grid Domain==
The authors start by evaluating our algorithm on a small grid world domain with 9 rooms, where they ca analyze the effect of the action elimination (visualization can be found in the appendix). In this domain, the agent starts at the center of the grid and needs to navigate to its upper-left corner. On every step, the agent suffers a penalty of (−1), with a terminal reward of 0. Prior to the game, the states are randomly divided into K categories. The environment has 4K navigation actions, 4 for each category, each with a probability to move in a random direction. If the chosen action belongs to the same category as the state, the action is performed correctly in probability pTc = 0.75. Otherwise, it will be performed correctly in probability pFc = 0.5. If the action does not fit the state category, the elimination signal equals 1, and if the action and state belong to the same category, then e = 0. The optimal policy will only use the navigation actions from the same type as the state, and all of the other actions are strictly suboptimal. A basic comparison between vanilla Q-learning without action elimination (green) and a tabular version of the action elimination Q-learning (blue) can be found in the figure below. In all of the figures, the results are compared to the case with one category (red), i.e., only 4 basic navigation actions, which forms an upper bound on performance with multiple categories. In Figure (a),(c), the episode length is T = 150, and in Figure (b) it is T = 300, to allow sufficient exploration for the vanilla Q-Learning. It is clear from the simulations that the action elimination dramatically improves the results in large action spaces. Also, note that the gain from action elimination increases with the grid size since the elimination allows the agent to reach the goal earlier.

[[File:griddomain.png|1200px|thumb|center|Performance of agents in grid world]]
==Zork domain==

The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

[[File:zork_domain.png|1200px|thumb|center|Left:the world of Zork.Right:subdomains of Zork.]]

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===

The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. An egg-splorer goes on an adventure to find a mystical ancient relic with his furry companion. You can have a look at the game at [https://scratch.mit.edu/projects/212838126/ EggQuest]

The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png|1200px|thumb|center|Performance of agents in the egg quest.]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents have performed well on sets a and c. However, the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===

The goal of this quest is to find the troll. To do it the agent needs to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png|400px|thumb|center|Results in the Troll Quest.]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===

Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps have been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[]]

[[File:AEF_open_zork_comparison.png|600px|thumb|center|Results in "Open Zork".]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.

They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.

The authors also hope to generate elimination signals in real-world domains and achieve the purpose of eliminating the signal implicitly.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the very famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.

Even with the critique above, the paper presents mathematical/theoretical justifications of the methodology. Moreover, since the methodology is built on the standard RL framework, this means that other variant RL algorithms can apply the idea to decrease the complexity and increase the performance. Moreover, the there are some rooms for applying technical variations for the algorithm.

Also, since we are utilizing the system's response to irrelevant actions, an intuitive approach to eliminate such irrelevant actions is to add a huge negative reward for such actions, which will be much easier than the approach suggested by this paper. However, the in experiments, the author only compares AE-DQN to traditional DQN, not traditional DQN with negative rewards assigned to irrelevant actions.

After all, the name that the authors have chosen is a good and attractive choice and matches our brain's structure which in so many real-world scenarios detects what not to learn.

=Reference=
1. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics.

2. Côté,M.-A.;Kádár,Á.;Yuan,X.;Kybartas,B.;Barnes,T.;Fine,E.;Moore,J.;Hausknecht,M.;Asri, L. E.; Adada, M.; et al. 2018. Textworld: A learning environment for text-based games. arXiv.

3. Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces. arXiv.

4. He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 2015. Deep reinforcement learning with an unbounded action space. CoRR abs/1511.04636.

5. Kim, Y. 2014. Convolutional neural networks for sentence classiﬁcation. [https://arxiv.org/abs/1408.5882 arXiv preprint].

6. VanHasselt,H.,andWiering,M.A. 2009. Usingcontinuousactionspacestosolvediscreteproblems. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, 1149–1156. IEEE.

7. Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279–292.

8. Su, P.-H.; Gasic, M.; Mrksic, N.; Rojas-Barahona, L.; Ultes, S.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. Continuously learning neural dialogue management. arXiv preprint.

9. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint.

10. Yuan, X.; Côté, M.-A.; Sordoni, A.; Laroche, R.; Combes, R. T. d.; Hausknecht, M.; and Trischler, A. 2018. Counting to explore and generalize in text-based games. arXiv preprint arXiv:1806.1152

11. Zahavy, T.; Haroush, M.; Merlis, N.; Mankowitz, D. J.; 2018. Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning.

DeepVO Towards end to end visual odometry with deep RNN

2018-12-01T02:32:47Z

J385chen: /* Training and Optimization */

== Introduction ==
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization” with notable applications in Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue that mainstream geometry based monocular VO methods should be amended with deep learning approaches. Deep Learning (DL) has recently achieved promising results in computer vision tasks but does not include the VO field, thus the paper proposes a novel deep-learning based end-to-end VO algorithm and then empirically demonstrates its viability.

== Related Work ==

Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation (pose) are obtained by identifying reference points and calculating how those points change over the image sequence. Algorithms in this category can be further divided into two: sparse feature based methods and direct methods, which differ in the method employed to select reference points. Sparse feature based methods establish reference points using image salient features such as corners and edges [8]. Direct methods, on the other hand, make use of the whole image and consider every pixel as a reference point [11]. Recently, semi-direct methods that combine the benefits of both approaches are gaining popularity [16].

Today, most of the state-of-the-art VO algorithms belong to the geometry family. However, they suffer significant limitations. For example, direct methods assume “photometric consistency” [11]. Sparse feature based methods are also prone to “drifting” because of outliers and noises. As a result, the paper argues that geometry-based methods are difficult to engineer and calibrate, limiting its practicality. Figure 1 illustrates the general architecture of geometry-based algorithms and it outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment).

[[File:DeepVO_Figure_1.png | center]]

<div align="center">Figure 1. Architectures of the conventional geometry-based monocular VO method.</div>

The second category of VO algorithms is based on learning. Namely, they try to learn an object’s motion model from labeled optical flows. Initially, these models are trained using classic Machine Learning techniques such as k-nearest neighbors (KNNs) [15], Gaussian Processes [16], and Support Vector Machines [17]. However, these models were inefficient to handle highly non-linear and high-dimensional inputs, leading to poor performance in comparison with geometry-based methods. For this reason, Deep Learning-based approaches are dominating research in this field and are producing many promising results. For example, CNN based models can now recognize places based on appearance [18] and detect direction and velocity from stereo inputs [20]. Moreover, a deep learning model even achieved robust VO with blurred and under-exposed images [21]. While these successes are encouraging, the authors observe that a CNN based architecture is “incapable of modeling sequential information.” Instead, they proposed to use RNN to tackle this problem.

== End-to-End Visual odometry through RCNN ==

=== Architecture Overview ===
An end-to-end monocular VO model is proposed by utilizing deep Recurrence Convolutional Neural Network (RCNN). Figure 2 depicts the end-to-end model, which is comprised of three main stages. First, the model takes a monocular video as input and pre-processes the image sequences by “subtracting the mean RGB values of all frames” from each frame. Then, consecutive image sequences are stacked to form tensors, which become the inputs for the CNN stage. The purpose of the CNN stages is to extract salient features from the image tensors. The structure of the CNN is inspired by FlowNet [24], which is a model designed to extract optical flows. Details of the CNN structure is shown in Table 1. Using CNN optical flow features as input, the RNN stage tries to estimate the temporal and sequential relations among the features. The RNN stage does this by utilizing two Long Short-Term Memory networks (LSTM), which estimate object poses for each time step using both long-term and short-term dependencies. Figure 3 illustrates the RNN architecture.

Without the LSTM framework, RNNs often experience vanishing gradients or gradient exploding. If the gradient is small and the network is deep, when it is propagated to the shallower layers during the backward pass, it often just becomes too small to have an effect on the weights. This forces standard RNN architectures to be relatively shallow for temporal prediction over time. In other words, the weight update for recent events will have a much larger effect on the network weights than events happened long-time ago. Visual odometry is a very complex problem, and thus we attempt to learn highly complex functions within the network. Hence, to circumvent the vanishing gradient issue, we use LSTM nodes. Conversely, LSTM can handle long-term dependencies and has deep temporal structure, but needs depth on network layers to learn complex high-level representation. LSTM define three additional gates: forget gate, input gate and update gate to help better capture the long-term dependencies. Deep RNNs have been shown to perform well on complex dynamic representations (e.g. speech recognition), and thus we leverage this architecture and layer multiple LSTM layers to mitigate vanishing gradient without losing the network's ability to represent complex dynamics.

[[File:DeepVO_Figure_2.png | center]]
<div align="center">Figure 2. Architectures of the proposed RCNN based monocular VO system.</div>

[[File:DeepVO_Table_1.png | center]]
<div align="center">Table 1. CNN structure</div>

[[File:DeepVO_Figure_3.png | center]]
<div align="center">Figure 3. Folded and unfolded LSTMs and its internal structure.</div>

=== Training and Optimization ===
The proposed RCNN model can be represented as a conditional probability of poses given an image sequence:

<math>
p(Y_{t}|X_{t}) = p(y_{1},...,y_{t}|x_{1},...,x_{t})
</math>

Given this probability function is expressed by a deep RCNN.
To find the optimal hyperparameters, the DNN maximizes:

<math>
\theta^{*}=argmax(Y{t}|X{t};\theta)
</math>

To learn the hyperparameters <math>\theta</math> of the DNNs, the Euclidean distance between the ground truth pose <math>(p_k,\phi_k)</math> at time k and its estimated one <math>(\hat{p}_k,\hat{\phi}_k)</math> is minimized. the loss function that is composed of Mean Square Error (MSE) of all positions p and orientations <math>\varphi</math> minimizes:

<math>
\theta^{*}=argmin\frac{1}{N}\sum_{N}^{i=1}\sum_{t}^{k=1}||\hat{p}_{k}-p_{k}||_{2}^{2}+\kappa||\hat{\varphi}_{k}-\varphi_{k}||_{2}^{2}
</math>

where || *|| is <math>L_{2}-norm</math>, <math>\kappa</math> (100 in the experiments) is a scale factor to balance the weights of positions and orientations, N is the number of samples, and the orientation φ is represented by Euler angles.

== Experiments and Results ==
The paper evaluates the proposed RCNN VO model by comparing it empirically with the open-source VO library of LIBVISO2 [7], which is a well-known geometry based model. The comparison is done using the KITTI VO/SLAM benchmark [3], which contains 22 image sequences, 11 of which are labeled with ground truths. Two separate experiments are performed.

1. Quantitatively Analysis is performed using only labeled image sequence. Namely, 4 of 11 image sequences were used for training and the others reserved for testing. Table 2 and Figure 6 outlines the result, showing that the proposed RCNN model performs consistently better than the monocular VISO2_M model. However, it performs worse than the stereo VISO2_S model.

[[File:DeepVO_Table_2.png |500px| center]]

[[File:DeepVO_Figure_6.png |500px| center]]

2. The generalizability of the proposed RCNN model is evaluated using the unlabeled image sequences. Figure 8 outlines the test result, showing that the proposed model is able to generalize better than the monocular VISO2_M model and performs roughly the same as the stereo VISO2_S model.

[[File:DeepVO_Figure_8.png |600px| center]]

== Conclusions ==
The paper presents a new RCNN VO model that combines the CNNs with the RNNs under the power of Deep RCNNs. It can achieve representation learning while sequential modelling of the the monocular VO. Although it is considered a viable approach, it is not expected to be a replacement to the classic geometry-based approach. However, from the experiment result, it can be a viable complement by combining geometry and DNN learning representations, knowledge and models to further improve VO's accuracy and robustness. The main contribution of the paper is threefold:

# The authors demonstrate that the monocular VO problem can be addressed in an end-to-end fashion based on DL, i.e., directly estimating poses from raw RGB images. Neither prior knowledge nor parameter is needed to recover the absolute scale.
#The authors propose a RCNN architecture enabling the DL based VO algorithm to be generalised to totally new environments by using the geometric feature representation learnt by the CNN.
# Sequential dependence and complex motion dynamics of an image sequence, which are of importance to the VO but cannot be explicitly or easily modelled by human, are implicitly encapsulated and automatically learnt by the RCNN.

== Critiques ==

This paper cannot be considered as a critical advance to the state of the art as the authors just suggest a method combining CNN and RNNs for the visual odometry problem. The authors also state that deep learning in terms of simple feed-forward Neural networks and CNNs has already been used in this problem. Only an RNN approach seems to have been not tried on this problem. The authors propose a combined RCNN and geometric-based approach towards the end of the paper. But it is not intuitive how these two potentially very diverse methods could be combined. The authors also do not explain any proposed methods for the combination. The authors don't build a compelling case against the state of the art methods or convincingly prove the superiority of the RCNN or a combined method. For example, the RCNN and other state of the art geometry-based methods have a deficiency of getting lower accuracies when shown a large open area in the images as mentioned by the authors. The authors put forth some techniques to solve this problem for the geometry approaches but they state that they do not have a similar method for the deep learning based approaches. Thus, in such scenarios, the methods proposed by the authors don't seem to work at all.

The paper advances the field of deep-learning based VO by creating a pioneering end-to-end model that is capable of extracting features and learning sequential dynamics from monocular videos. While the new model clearly outperforms the LIBVISO2_M algorithm, it fails to demonstrate any advantage over the LIBVISO2_S algorithm. Hence, it makes one question whether the complexity of deep-learning based monocular VO methods is justified and whether robots or autonomous vehicles designers should opt for stereo visions as much as possible. Nonetheless, this end-to-end model is beneficial for situations where monocular VO is the only viable option. Furthermore, the paper could have benefited by including a qualitative comparison of the algorithm’s computation requirements, such as hardware specification, engineering time, and training time. Though the justification for input sequence pre-processing is not explained completely, but it can be attributed to the fact that they are using standard pre-processing techniques like mean Subtraction and normalization, which helps in easier optimization of cost functions. Perhaps, future-works could involve adapting the model for real-time visual odometry.

== References ==
[1] S. Wang, R. Clark, H. Wen and N. Trigoni, "DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.

[2] M. Maimone, Y. Cheng, and L. Matthies, "Two years of Visual Odometry on the Mars Exploration Rovers," Journal of Field Robotics. 24 (3): 169–186, 2007.

[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[7] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011.

[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.

[15] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch, “Memory-based learning for visual odometry,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2008, pp. 47–52.

[16] V. Guizilini and F. Ramos, “Semi-parametric learning for visual odometry,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 526–546, 2013.

[17] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.

[18] N. Su ̈nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.

[20] A. Kendall, M. Grimes, and R. Cipolla, “Convolutional networks for real-time 6-DoF camera relocalization,” in Proceedings of International Conference on Computer Vision (ICCV), 2015.

[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.18–25, 2016.

[24] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2758–2766.

[25]http://cs231n.github.io/neural-networks-2/

Synthesizing Programs for Images usingReinforced Adversarial Learning

2018-12-01T02:15:43Z

J385chen: /* Environments */

'''Synthesizing Programs for Images using Reinforced Adversarial Learning: ''' Summary of the ICML 2018 paper

Paper: [[http://proceedings.mlr.press/v80/ganin18a.html]]
Video: [[https://www.youtube.com/watch?v=iSyvwAwa7vk&feature=youtu.be]]

== Presented by ==

1. Nekoei, Hadi [Quest ID: 20727088]

= Motivation =

Conventional neural generative models have major problems.

* It is not clear how to inject knowledge about the data into the model.

* Latent space is not easily interpretative.

The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.

= Introduction =

Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works.

In the visual domain, inversion of a renderer for the purposes of scene understanding is typically referred to as inverse graphics. However, training vision systems using the inverse graphics approach has remained a challenge. Renderers typically expect as input programs that have sequential semantics, are composed of discrete symbols (e.g., keystrokes in a CAD program), and are long (tens or hundreds of symbols). Additionally, matching rendered images to real data poses an optimization problem as black-box graphics simulators are not differentiable in general.

To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.

[[File:Fig1 SPIRAL.PNG | 400px|center]]

== Related Work ==
Related works in this filed is summarized as follows:
* There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)

* Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)

* Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)

* inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).

'''However, all of these methods have limitations such as:'''

* Scaling to larger real-world datasets

* Requiring hand-crafted parses and supervision in the form of sketches and corresponding images

* Lack the ability to infer structured representations of images

= The SPIRAL Agent =
=== Overview ===
The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes.
In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>p_d</math>.

== Objectives ==
The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.

'''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as:

\begin{align}
\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R
\end{align}

where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).

'''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:

\begin{align}
\mathcal{L}_G = -\sum_{t}\log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]
\end{align}

where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a
1-sample Monte-Carlo estimate of the return. Rewards are set to:

<math>
r_t = \begin{cases}
0 & \text{for } t < N \\
D(\mathcal{R}(a_1, a_2, \cdots, a_N)) & \text{for } t = N
\end{cases}
</math>

One interesting aspect of this new formulation is that
the search can be biased by introducing intermediate rewards
which may depend not only on the output of R but also on
commands used to generate that output.

== Conditional generation: ==
In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:
<math>
p_g = R(p_a(a|x_{target}))
</math>

While <math>p_{d}</math> becomes a Dirac-<math>\delta</math> function centered at <math>x_{target}</math>.
For the first two terms in the objective function for D, they reduce to
<math>
-D(x_{target}|x_{target})+ \mathbb{E}_{x\sim p_g}[D(x|x_{target})]
</math>

It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator. It may be as a poor candidate for the reward signal of the generator, even if it is not the only solution of the objective function for D.

===Traditional GAN generation: ===
Traditional GANs use the following minimax objective function to quantify optimality for relationships between D and G:

[[File:edit7.png| 400px|center]]

Minimizing the Jensen-Shannon divergence between the two distribution often leads to vanishing gradients as the discriminator saturates. We circumvent this issue using the conditional generation function, which is much better behaved.

== Distributed Learning: ==
The training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, three kinds of workers are defined:

* Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate
renderings produced by R.

* A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), <math>\mathcal{L}_G</math> is augmented with an entropy penalty encouraging exploration.

* In contrast to the base '''IMPALA''' setup, an additional discriminator learner is defined. This worker consumes random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).

[[File:Fig2 SPIRAL Architecture.png | 700px|center]]

'''Note:''' no trajectories are omitted in the policy learner. Instead, the <math>D</math> updates is decoupled from the <math>\pi</math> updates by introducing a replay buffer that serves as a communication layer between the actors and the discriminator learner. That allows the latter to optimize <math>D</math> at a higher rate than the training of the policy network due to the difference in network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain '''CNN'''). Even though sampling from a replay buffer inevitably results in smoothing of <math>p_{g}</math>, this setup is found to work well in practice.

= Experiments=

== Environments ==
Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint
contributors, 2018).) is used. The agent controls a brush and produces
a sequence of (possibly disjoint) strokes on a canvas
C. The state of the environment is comprised of the contents
of <math>C</math> as well as the current brush location <math>l_{t}</math>. Each action
<math>a_{t}</math> is a tuple of 8 discrete decisions <math>(a_t^1; a_t^2; ... ; a_t^8)</math> (see
Figure 3). The first two components are the control point <math>p_{c}</math>
and the endpoint <math>l_{t+1}</math> of the stroke.

[[File:Fig3_agent_action_space.PNG | 450px|center]]

The next 5
components represent the appearance of the stroke: the
pressure that the agent applies to the brush (10 levels), the
brush size, and the stroke color characterized by a mixture
of red, green and blue (20 bins for each color component).
The last element of at is a binary flag specifying the type
of action: the agent can choose either to produce a stroke
or to jump right to <math>l_{t+1}</math>.

In the MUJOCO SCENES experiment, we render images
using a MuJoCo-based environment (Todorov et al., 2012).
At each time step, the agent has to decide on the object
type (4 options), its location on a 16 <math>\times</math> 16 grid, its size
(3 options) and the color (3 color components with 4 bins
each). The resulting tuple is sent to the environment, which
adds an object to the scene according to the specification.

== Datasets ==

=== MNIST ===
For the MNIST dataset, two sets of experiments are conducted:

1- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by the discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a.

2- In the second experiment, an agent is trained to reproduce a given digit.
Several examples of conditional generated digits are shown in Fig 4b.

[[File:Fig4a MNIST.png | 450px|center]]

=== OMNIGLOT ===
Now the trained agents are tested in a similar but more challenging setting of handwritten characters. As can be seen in Fig 5a, the unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b). Moreover, as OMNIGLOT has lots of different symbols, the model that we created was able to learn a general idea of image production without memorizing the training data. We tested this result by inputting new unseen line drawings to our trained agent. As we concluded, it provided excellent results as shown in Figure 6.

[[File:Fig5 OMNIGLOT.png | 450px|center]]

For the MNIST dataset, two kinds of rewards, discriminator score and <math>l^{2}-\text{distance}</math> has been compared. Note that the discriminator based approach has a significantly lower training time and lower final <math>l^{2}</math> error.
Following (Sharma et al., 2017), also a “blind” version of the agent without feeding any intermediate canvas states as an input to <math>\pi</math> is trained. The training curve for this experiment is also reported in Fig 8a.
(dotted blue line) The results of training agents with discriminator based and <math>l^{2}-\text{distance}</math> approach is shown in Fig 8a as well.

=== CELEBA ===

Since the ''libmypaint'' environment is also capable of producing
complex color paintings, this direction is explored by
training a conditional agent on the CELEBA dataset. In this
experiment, the agent does not receive any intermediate rewards.
In addition to the reconstruction reward (either <math>l^2</math> or
discriminator-based), earth mover’s
distance between the color histograms of the model’s output
and <math>x_{target}</math> is penalized. (Figure 7)

[[File:Fig6 CELEBA.png | 450px|center]]

Although blurry, the model’s reconstruction closely matches
the high-level structure of each image such as the
background color, the position of the face, and the color of
the person’s hair. In some cases, shadows around eyes and
the nose are visible.

=== MUJOCO SCENES ===

For the MUJOCO SCENES dataset, the trained agent is used to construct simple CAD programs that best explain input images. Here only the case of the conditional generation is considered. Like before, the reward function for the generator can be either the <math>l^2</math> score or the discriminator output. In addition, there are not any auxiliary reward signals. This model has the capacity to infer and represent up to 20 objects and their attributes due to its unrolled 20 time steps.

As shown in Figure 8b, the agent trained to directly minimize
<math>l^2</math> is unable to solve the task and has significantly
higher pixel-wise error. In comparison, the discriminator based
variant solves the task and produces near-perfect reconstructions
on a holdout set (Figure 10).

[[File:Fig8 MUJOCO_SCENES.png | 500px|center]]
For this experiment, the total number of possible execution traces is <math>M^N</math>, where <math>M = 4·16^2·3·4^3·3 </math> is the total number of attribute settings for a single object and N = 20 is the length of an episode. Then a general-purpose Metropolis-Hastings inference algorithm that samples an execution trace defining attributes for a maximum of 20 primitives was run on a set of 100 images. These attributes are considered as latent variables. During each time step of the inference, the attribute blocks (including presence/absence tags) corresponding to a single object are evenly flipped over the appropriate range. The resulting trace is presented as an output sample by the environment and then the output sample is accepted or rejected using the Metropolis-Hastings update rule, where the Gaussian likelihood is centered on the test image and the fixed diagonal covariance is 0.25. From Figure 9, the MCMC search baseline cannot solve the task even after a lot of evaluation.
[[File:figure9 mcmc.PNG| 500px|center]]

= Discussion =
As in the OMNIGLOT
experiment, the <math>l^2</math>-based agent demonstrates some
improvements over the random policy but gets stuck and as
a result, fails to learn sensible reconstructions (Figure 8b).

[[File:Fig7 Results.png | 500px|center]]

Scaling visual program synthesis to the real world and combinatorial
datasets has been a challenge. It has been shown that it is possible to train an adversarial generative agent employing
black-box rendering simulator. Our results indicate that
using the Wasserstein discriminator’s output as a reward
function with asynchronous reinforcement learning can provide
a scaling path for visual program synthesis. The current
exploration strategy used in the agent is entropy-based but
future work should address this limitation by employing sophisticated
search algorithms for policy improvement. For
instance, Monte Carlo Tree Search can be used, analogous
to AlphaGo Zero (Silver et al., 2017). General-purpose
inference algorithms could also be used for this purpose.

= Cririque and Future Work =
* The architecture isn't new but it's a nice application and it's fun to watch the video of the robot painting in real life. SPIRAL's GAN-like idea continues the vein of [https://arxiv.org/abs/1610.01945 connecting actor-critic RL with GANs] like " [https://arxiv.org/abs/1706.03741Deep reinforcement learning from human preferences]" , Christiano et al 2017 or GAIL:

* Future work should explore different parameterizations of action spaces. For instance, the use of two arbitrary control points are perhaps not the best way to represent strokes, as it is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes, and learned texture models to invert richer visual scenes.

* On the reward side, using a joint image-action discriminator similar to BiGAN/ALI (Donahue et al., 2016; Dumoulin et al., 2016) (in this case, the policy can be viewed as an encoder, while the renderer becomes a decoder) could result in a more meaningful learning signal, since D will be forced to focus on the semantics of the image.

= Other Resources =
#Code implementation [https://github.com/carpedm20/SPIRAL-tensorflow]

= References =

# Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S.M. Ali Eslami, Oriol Vinyals, [[https://arxiv.org/abs/1804.01118]].

Synthesizing Programs for Images usingReinforced Adversarial Learning

2018-12-01T02:13:43Z

J385chen: /* Environments */

'''Synthesizing Programs for Images using Reinforced Adversarial Learning: ''' Summary of the ICML 2018 paper

Paper: [[http://proceedings.mlr.press/v80/ganin18a.html]]
Video: [[https://www.youtube.com/watch?v=iSyvwAwa7vk&feature=youtu.be]]

== Presented by ==

1. Nekoei, Hadi [Quest ID: 20727088]

= Motivation =

Conventional neural generative models have major problems.

* It is not clear how to inject knowledge about the data into the model.

* Latent space is not easily interpretative.

The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.

= Introduction =

Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works.

In the visual domain, inversion of a renderer for the purposes of scene understanding is typically referred to as inverse graphics. However, training vision systems using the inverse graphics approach has remained a challenge. Renderers typically expect as input programs that have sequential semantics, are composed of discrete symbols (e.g., keystrokes in a CAD program), and are long (tens or hundreds of symbols). Additionally, matching rendered images to real data poses an optimization problem as black-box graphics simulators are not differentiable in general.

To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.

[[File:Fig1 SPIRAL.PNG | 400px|center]]

== Related Work ==
Related works in this filed is summarized as follows:
* There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)

* Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)

* Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)

* inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).

'''However, all of these methods have limitations such as:'''

* Scaling to larger real-world datasets

* Requiring hand-crafted parses and supervision in the form of sketches and corresponding images

* Lack the ability to infer structured representations of images

= The SPIRAL Agent =
=== Overview ===
The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes.
In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>p_d</math>.

== Objectives ==
The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.

'''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as:

\begin{align}
\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R
\end{align}

where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).

'''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:

\begin{align}
\mathcal{L}_G = -\sum_{t}\log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]
\end{align}

where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a
1-sample Monte-Carlo estimate of the return. Rewards are set to:

<math>
r_t = \begin{cases}
0 & \text{for } t < N \\
D(\mathcal{R}(a_1, a_2, \cdots, a_N)) & \text{for } t = N
\end{cases}
</math>

One interesting aspect of this new formulation is that
the search can be biased by introducing intermediate rewards
which may depend not only on the output of R but also on
commands used to generate that output.

== Conditional generation: ==
In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:
<math>
p_g = R(p_a(a|x_{target}))
</math>

While <math>p_{d}</math> becomes a Dirac-<math>\delta</math> function centered at <math>x_{target}</math>.
For the first two terms in the objective function for D, they reduce to
<math>
-D(x_{target}|x_{target})+ \mathbb{E}_{x\sim p_g}[D(x|x_{target})]
</math>

It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator. It may be as a poor candidate for the reward signal of the generator, even if it is not the only solution of the objective function for D.

===Traditional GAN generation: ===
Traditional GANs use the following minimax objective function to quantify optimality for relationships between D and G:

[[File:edit7.png| 400px|center]]

Minimizing the Jensen-Shannon divergence between the two distribution often leads to vanishing gradients as the discriminator saturates. We circumvent this issue using the conditional generation function, which is much better behaved.

== Distributed Learning: ==
The training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, three kinds of workers are defined:

* Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate
renderings produced by R.

* A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), <math>\mathcal{L}_G</math> is augmented with an entropy penalty encouraging exploration.

* In contrast to the base '''IMPALA''' setup, an additional discriminator learner is defined. This worker consumes random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).

[[File:Fig2 SPIRAL Architecture.png | 700px|center]]

'''Note:''' no trajectories are omitted in the policy learner. Instead, the <math>D</math> updates is decoupled from the <math>\pi</math> updates by introducing a replay buffer that serves as a communication layer between the actors and the discriminator learner. That allows the latter to optimize <math>D</math> at a higher rate than the training of the policy network due to the difference in network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain '''CNN'''). Even though sampling from a replay buffer inevitably results in smoothing of <math>p_{g}</math>, this setup is found to work well in practice.

= Experiments=

== Environments ==
Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint
contributors, 2018).) is used. The agent controls a brush and produces
a sequence of (possibly disjoint) strokes on a canvas
C. The state of the environment is comprised of the contents
of <math>C</math> as well as the current brush location <math>l_{t}</math>. Each action
<math>a_{t}</math> is a tuple of 8 discrete decisions <math>(a_t^1; a_t^2; ... ; a_t^8)</math> (see
Figure 3). The first two components are the control point <math>p_{c}</math>
and the endpoint <math>l_{t+1}</math> of the stroke.

[[File:Fig3_agent_action_space.PNG | 450px|center]]

The next 5
components represent the appearance of the stroke: the
pressure that the agent applies to the brush (10 levels), the
brush size, and the stroke color characterized by a mixture
of red, green and blue (20 bins for each color component).
The last element of at is a binary flag specifying the type
of action: the agent can choose either to produce a stroke
or to jump right to $l_{t+1}$.

In the MUJOCO SCENES experiment, we render images
using a MuJoCo-based environment (Todorov et al., 2012).
At each time step, the agent has to decide on the object
type (4 options), its location on a 16 <math>\times</math> 16 grid, its size
(3 options) and the color (3 color components with 4 bins
each). The resulting tuple is sent to the environment, which
adds an object to the scene according to the specification.

== Datasets ==

=== MNIST ===
For the MNIST dataset, two sets of experiments are conducted:

1- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by the discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a.

2- In the second experiment, an agent is trained to reproduce a given digit.
Several examples of conditional generated digits are shown in Fig 4b.

[[File:Fig4a MNIST.png | 450px|center]]

=== OMNIGLOT ===
Now the trained agents are tested in a similar but more challenging setting of handwritten characters. As can be seen in Fig 5a, the unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b). Moreover, as OMNIGLOT has lots of different symbols, the model that we created was able to learn a general idea of image production without memorizing the training data. We tested this result by inputting new unseen line drawings to our trained agent. As we concluded, it provided excellent results as shown in Figure 6.

[[File:Fig5 OMNIGLOT.png | 450px|center]]

For the MNIST dataset, two kinds of rewards, discriminator score and <math>l^{2}-\text{distance}</math> has been compared. Note that the discriminator based approach has a significantly lower training time and lower final <math>l^{2}</math> error.
Following (Sharma et al., 2017), also a “blind” version of the agent without feeding any intermediate canvas states as an input to <math>\pi</math> is trained. The training curve for this experiment is also reported in Fig 8a.
(dotted blue line) The results of training agents with discriminator based and <math>l^{2}-\text{distance}</math> approach is shown in Fig 8a as well.

=== CELEBA ===

Since the ''libmypaint'' environment is also capable of producing
complex color paintings, this direction is explored by
training a conditional agent on the CELEBA dataset. In this
experiment, the agent does not receive any intermediate rewards.
In addition to the reconstruction reward (either <math>l^2</math> or
discriminator-based), earth mover’s
distance between the color histograms of the model’s output
and <math>x_{target}</math> is penalized. (Figure 7)

[[File:Fig6 CELEBA.png | 450px|center]]

Although blurry, the model’s reconstruction closely matches
the high-level structure of each image such as the
background color, the position of the face, and the color of
the person’s hair. In some cases, shadows around eyes and
the nose are visible.

=== MUJOCO SCENES ===

For the MUJOCO SCENES dataset, the trained agent is used to construct simple CAD programs that best explain input images. Here only the case of the conditional generation is considered. Like before, the reward function for the generator can be either the <math>l^2</math> score or the discriminator output. In addition, there are not any auxiliary reward signals. This model has the capacity to infer and represent up to 20 objects and their attributes due to its unrolled 20 time steps.

As shown in Figure 8b, the agent trained to directly minimize
<math>l^2</math> is unable to solve the task and has significantly
higher pixel-wise error. In comparison, the discriminator based
variant solves the task and produces near-perfect reconstructions
on a holdout set (Figure 10).

[[File:Fig8 MUJOCO_SCENES.png | 500px|center]]
For this experiment, the total number of possible execution traces is <math>M^N</math>, where <math>M = 4·16^2·3·4^3·3 </math> is the total number of attribute settings for a single object and N = 20 is the length of an episode. Then a general-purpose Metropolis-Hastings inference algorithm that samples an execution trace defining attributes for a maximum of 20 primitives was run on a set of 100 images. These attributes are considered as latent variables. During each time step of the inference, the attribute blocks (including presence/absence tags) corresponding to a single object are evenly flipped over the appropriate range. The resulting trace is presented as an output sample by the environment and then the output sample is accepted or rejected using the Metropolis-Hastings update rule, where the Gaussian likelihood is centered on the test image and the fixed diagonal covariance is 0.25. From Figure 9, the MCMC search baseline cannot solve the task even after a lot of evaluation.
[[File:figure9 mcmc.PNG| 500px|center]]

= Discussion =
As in the OMNIGLOT
experiment, the <math>l^2</math>-based agent demonstrates some
improvements over the random policy but gets stuck and as
a result, fails to learn sensible reconstructions (Figure 8b).

[[File:Fig7 Results.png | 500px|center]]

Scaling visual program synthesis to the real world and combinatorial
datasets has been a challenge. It has been shown that it is possible to train an adversarial generative agent employing
black-box rendering simulator. Our results indicate that
using the Wasserstein discriminator’s output as a reward
function with asynchronous reinforcement learning can provide
a scaling path for visual program synthesis. The current
exploration strategy used in the agent is entropy-based but
future work should address this limitation by employing sophisticated
search algorithms for policy improvement. For
instance, Monte Carlo Tree Search can be used, analogous
to AlphaGo Zero (Silver et al., 2017). General-purpose
inference algorithms could also be used for this purpose.

= Cririque and Future Work =
* The architecture isn't new but it's a nice application and it's fun to watch the video of the robot painting in real life. SPIRAL's GAN-like idea continues the vein of [https://arxiv.org/abs/1610.01945 connecting actor-critic RL with GANs] like " [https://arxiv.org/abs/1706.03741Deep reinforcement learning from human preferences]" , Christiano et al 2017 or GAIL:

* Future work should explore different parameterizations of action spaces. For instance, the use of two arbitrary control points are perhaps not the best way to represent strokes, as it is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes, and learned texture models to invert richer visual scenes.

* On the reward side, using a joint image-action discriminator similar to BiGAN/ALI (Donahue et al., 2016; Dumoulin et al., 2016) (in this case, the policy can be viewed as an encoder, while the renderer becomes a decoder) could result in a more meaningful learning signal, since D will be forced to focus on the semantics of the image.

= Other Resources =
#Code implementation [https://github.com/carpedm20/SPIRAL-tensorflow]

= References =

# Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S.M. Ali Eslami, Oriol Vinyals, [[https://arxiv.org/abs/1804.01118]].

Co-Teaching

2018-12-01T02:07:53Z

J385chen:

=Introduction=
==Title of Paper==
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels
==Contributions==
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously, which focuses on training on selected clean instances and avoids estimating the noise transition matrix. In addition, using stochastic optimization with momentum to train the deep networks and clean data can be memorized by nonlinear deep networks, which becomes robust. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness of deep learning models trained by the co-teaching approach is much superior to state-of-the-art baselines

==Terminology==
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data.

Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels. This can result in false positives.

=Intuition=
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually when learning on a batch of noisy data, only the error from the network itself is transferred back to facilitate learning. But in the case of Co-teaching, the two networks are able to filter different type of errors, and flow back to itself and the other network. As a result, the two models learn together, from the network itself and the partner network.

=Motivation=
The paper draws motivation from two key facts:
• That many data collection processes yield noisy labels.
• That deep neural networks have a high capacity to overfit to noisy labels.
Because of these facts, it is challenging to train deep networks to be robust with noisy labels.
=Related Works=

1. Statistical learning methods: Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modelling. In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modelling category, there is a two coin model proposed to handle noise labels from multiple annotators.

2. Deep learning methods: There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously.
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators.

3. Learning to teach methods: It is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. Most works did not account for noisy labels, with exception to MentorNet, which applied the idea on data with noisy labels.

=Co-Teaching Algorithm=

[[File:Co-Teaching_Algorithm.png|600px|center]]

The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch using mini-batch gradient descent, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network. <math>R(T)</math> governs the percentage of small-loss instances to be used in updating the parameters of each network.

=Summary of Experiment=
==Proposed Method==
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network.
[[File:Co-Teaching Fig 1.png|600px|center]]
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate).
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself.
==Dataset Corruption==
The datasets incorporated by this paper include MNIST, CIFAR-10 and CIFAR-100. A summary of these datasets are shown as below.

[[File:co_teaching_data.png|600px|center]]

To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix<math>Q</math>, where where <math>Q_{ij} = Pr(\widetilde{y} = j|y = i)</math> given that noisy <math>\widetilde{y}</math> is flipped from clean <math>y</math>. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry.
[[File:Co-Teaching Fig 2.png|600px|center]]
Three noise conditions are simulated for comparing co-teaching with baseline methods.

Note: Corruption of Dataset here means randomly choosing a wrong label instead of the target label by applying noise.

{| class="wikitable"
{| border="1" cellpadding="3"
|-
|width="60pt"|Method
|width="100pt"|Noise Rate
|width="700pt"|Rationale
|-
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels.
|-
| Symmetry || 50% || Half of the instances have noisy labels. Labels have a constant probability of being corrupted. Further rationale can be found at [1].
|-
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario.
|}
|}

==Baseline Comparisons==
The co-teaching method is compared with several baseline approaches, which have varying:
• proficiency in dealing with a large number of classes,
• ability to resist heavy noise,
• need to combine with specific network architectures, and
• need to be pretrained.

[[File:Co-Teaching Fig 3.png|600px|center]]
===Bootstrap===
The general idea behind bootstrapping is to dynamically change (correct) noisy labels during training. The idea is to take a value derived from the original and predicted class. The final label is some convex combination of the two. It should be noted that the weighting of the prediction is increased over time to account for the model itself improving. Of course, this procedure needs to be finely tuned to prevent it from rampantly changing correct labels before it becomes accurate. [2].

===S-Model===
Using an additional softmax layer to model the noise transition matrix [3].
===F-Correction===
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].
===Decoupling===
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].
===MentorNet===
A mentor network is weights the probability of data instances being clean/noisy in order to train the student network on cleaner instances [6].

As shown in the above table - few of the advantages of Co-teaaching method include - Co-teaching
method does not rely on any specific network architectures, which can also deal with a large number of classes and is more robust to noise. Besides, it can be trained from scratch.This makes teaching more appealing for practical usage.

==Implementation Details==
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization.

[[File: Co-Teaching Table 3.png|center]]
=Results and Discussion=
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows.
==MNIST==
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations

The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet.

Robustness of the proposed method to noise which plays an important rule in the evaluation, is evident in the plots which is better or comparable to the other methods.

[[File:Co-Teaching Table 4.png|550px|center]]

[[File:Co-Teaching Graphs MNIST.PNG|center]]

==CIFAR10==
The observations here are consistently the same as these for MNIST dataset.
[[File:Co-Teaching Table 5.png|550px|center]]

[[File:Co-Teaching Graphs CIFAR10.PNG|center]]
==CIFAR100==
[[File:Co-Teaching Table 6.png|550px|center]]

[[File: Co-Teaching Graphs CIFAR100.PNG|center]]
==Choice of R(T) and <math> \tau</math>==
There were some principles they followed when it came to choosing R(T) and <math> \tau</math>. R(T)=1, there was no instance needed at the beginning. They could safely update parameters in the early stage using the whole noise data since the deep neural networks would not memorize the noisy data. However, they need to drop more instances at the later stage. Because the model would eventually try to fit noisy data.

R(T)=1-<math> \tau </math> *min{<math>T^{c}/T_{k},1 </math>} with <math> \tau=\epsilon </math>, where <math> \epsilon </math> is noise level.
In this case, we consider c={0.5,1,2}. From Table 7, the test accuracy is stable.

[[File: Co-Teaching Table 7.png|550px|center]]

For <math> \tau</math>, we consider <math> \tau={0.5,0.75,1,1.25,1.5}\epsilon</math>. From Table 8, the performance can be improved with dropping more instances.
[[File: Co-Teaching Table 8.png|550px|center]]

=Conclusions=
The main goal of the paper is to introduce the “Co-teaching” learning paradigm that uses two deep neural networks learning simultaneously to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depending on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).

=Future Work=
For future work, the paper can be extended in following ways: First , the the Co-teaching program can be adapted to train deep models under weak supervisions , e.g positive and unlabeled data. Second theoretical guarantees for Co-teaching can be investigated. Further , there is no analysis for generalization performance on deep learning with noisy labels which can also be studied in future.

=Critique=
The paper evaluate the performance considering the complexity of computations and implementations of the algorithms. Co-teaching methodology seems an interesting idea but can possible become tricky to implement. Technically, such complexity can demage the performance of the algorithm.
==Lack of Task Diversity==
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality.
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm.
==Lack of Theoretical Development (Mentioned in conclusion)==
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.

=References=
[1] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The
importance of being unhinged. In NIPS, 2015.

[2] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural
networks on noisy labels with bootstrapping. In ICLR, 2015.

[3] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer.
In ICLR, 2017.

[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to
label noise: A loss correction approach. In CVPR, 2017.

[5] E. Malach and S. Shalev-Shwartz. Decoupling" when to update" from" how to update". In
NIPS, 2017.

[6] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum
for very deep neural networks on corrupted labels. In ICML, 2018.

Co-Teaching

2018-12-01T02:04:05Z

J385chen:

=Introduction=
==Title of Paper==
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels
==Contributions==
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously, which focuses on training on selected clean instances and avoids estimating the noise transition matrix. In addition, using stochastic optimization with momentum to train the deep networks and clean data can be memorized by nonlinear deep networks, which becomes robust. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art baselines

==Terminology==
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data.

Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels. This can result in false positives.

=Intuition=
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually when learning on a batch of noisy data, only the error from the network itself is transferred back to facilitate learning. But in the case of Co-teaching, the two networks are able to filter different type of errors, and flow back to itself and the other network. As a result, the two models learn together, from the network itself and the partner network.

=Motivation=
The paper draws motivation from two key facts:
• That many data collection processes yield noisy labels.
• That deep neural networks have a high capacity to overfit to noisy labels.
Because of these facts, it is challenging to train deep networks to be robust with noisy labels.
=Related Works=

1. Statistical learning methods: Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling. In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators.

2. Deep learning methods: There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously.
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators.

3. Learning to teach methods: It is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. Most works did not account for noisy labels, with exception to MentorNet, which applied the idea on data with noisy labels.

=Co-Teaching Algorithm=

[[File:Co-Teaching_Algorithm.png|600px|center]]

The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch using mini-batch gradient descent, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network. <math>R(T)</math> governs the percentage of small-loss instances to be used in updating the parameters of each network.

=Summary of Experiment=
==Proposed Method==
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network.
[[File:Co-Teaching Fig 1.png|600px|center]]
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate).
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself.
==Dataset Corruption==
The datasets incorporated by this paper include MNIST, CIFAR-10 and CIFAR-100. A summary of these datasets are shown as below.

[[File:co_teaching_data.png|600px|center]]

To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix<math>Q</math>, where where <math>Q_{ij} = Pr(\widetilde{y} = j|y = i)</math> given that noisy <math>\widetilde{y}</math> is flipped from clean <math>y</math>. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry.
[[File:Co-Teaching Fig 2.png|600px|center]]
Three noise conditions are simulated for comparing co-teaching with baseline methods.

Note: Corruption of Dataset here means randomly choosing a wrong label instead of the target label by applying noise.

{| class="wikitable"
{| border="1" cellpadding="3"
|-
|width="60pt"|Method
|width="100pt"|Noise Rate
|width="700pt"|Rationale
|-
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels.
|-
| Symmetry || 50% || Half of the instances have noisy labels. Labels have a constant probability of being corrupted. Further rationale can be found at [1].
|-
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario.
|}
|}

==Baseline Comparisons==
The co-teaching method is compared with several baseline approaches, which have varying:
• proficiency in dealing with a large number of classes,
• ability to resist heavy noise,
• need to combine with specific network architectures, and
• need to be pretrained.

[[File:Co-Teaching Fig 3.png|600px|center]]
===Bootstrap===
The general idea behind bootstrapping is to dynamically change (correct) noisy labels during training. The idea is to take a value derived from the original and predicted class. The final label is some convex combination of the two. It should be noted that the weighting of the prediction is increased over time to account for the model itself improving. Of course, this procedure needs to be finely tuned to prevent it from rampantly changing correct labels before it becomes accurate. [2].

===S-Model===
Using an additional softmax layer to model the noise transition matrix [3].
===F-Correction===
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].
===Decoupling===
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].
===MentorNet===
A mentor network is weights the probability of data instances being clean/noisy in order to train the student network on cleaner instances [6].

As shown in the above table - few of the advantages of Co-teaaching method include - Co-teaching
method does not rely on any specific network architectures, which can also deal with a large number of classes and is more robust to noise. Besides, it can be trained from scratch.This makes teaching more appealing for practical usage.

==Implementation Details==
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization.

[[File: Co-Teaching Table 3.png|center]]
=Results and Discussion=
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows.
==MNIST==
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations

The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet.

Robustness of the proposed method to noise which plays an important rule in the evaluation, is evident in the plots which is better or comparable to the other methods.

[[File:Co-Teaching Table 4.png|550px|center]]

[[File:Co-Teaching Graphs MNIST.PNG|center]]

==CIFAR10==
The observations here are consistently the same as these for MNIST dataset.
[[File:Co-Teaching Table 5.png|550px|center]]

[[File:Co-Teaching Graphs CIFAR10.PNG|center]]
==CIFAR100==
[[File:Co-Teaching Table 6.png|550px|center]]

[[File: Co-Teaching Graphs CIFAR100.PNG|center]]
==Choice of R(T) and <math> \tau</math>==
There were some principles they followed when it came to choosing R(T) and <math> \tau</math>. R(T)=1, there was no instance needed at the beginning. They could safely update parameters in the early stage using the whole noise data since the deep neural networks would not memorize the noisy data. However, they need to drop more instances at the later stage. Because the model would eventually try to fit noisy data.

R(T)=1-<math> \tau </math> *min{<math>T^{c}/T_{k},1 </math>} with <math> \tau=\epsilon </math>, where <math> \epsilon </math> is noise level.
In this case, we consider c={0.5,1,2}. From Table 7, the test accuracy is stable.

[[File: Co-Teaching Table 7.png|550px|center]]

For <math> \tau</math>, we consider <math> \tau={0.5,0.75,1,1.25,1.5}\epsilon</math>. From Table 8, the performance can be improved with dropping more instances.
[[File: Co-Teaching Table 8.png|550px|center]]

=Conclusions=
The main goal of the paper is to introduce the “Co-teaching” learning paradigm that uses two deep neural networks learning simultaneously to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depending on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).

=Future Work=
For future work, the paper can be extended in following ways: First , the the Co-teaching program can be adapted to train deep models under weak supervisions , e.g positive and unlabeled data. Second theoretical guarantees for Co-teaching can be investigated. Further , there is no analysis for generalization performance on deep learning with noisy labels which can also be studied in future.

=Critique=
The paper evaluate the performance considering the complexity of computations and implementations of the algorithms. Co-teaching methodology seems an interesting idea but can possible become tricky to implement. Technically, such complexity can demage the performance of the algorithm.
==Lack of Task Diversity==
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality.
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm.
==Lack of Theoretical Development (Mentioned in conclusion)==
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.

=References=
[1] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The
importance of being unhinged. In NIPS, 2015.

[2] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural
networks on noisy labels with bootstrapping. In ICLR, 2015.

[3] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer.
In ICLR, 2017.

[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to
label noise: A loss correction approach. In CVPR, 2017.

[5] E. Malach and S. Shalev-Shwartz. Decoupling" when to update" from" how to update". In
NIPS, 2017.

[6] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum
for very deep neural networks on corrupted labels. In ICML, 2018.

Zero-Shot Visual Imitation

2018-12-01T02:01:06Z

J385chen: /* Goal Recognizer */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from. In this paper, an alternative
paradigm is pursued wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss.
Videos, models, and more details are available at [[https://pathak22.github.io/zeroshot-imitation/]].

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images, instead of observation-action pairs. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'', and is learned by re-labeling states that the agent visited as goals and the actions the agent taken as prediction targets via self-supervised way. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A major challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'', which essentially says that reaching the goal is more important than how it is reached. First, a forward model that predicts the next observation from the given action and current observation is learned. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss does not inadvertently penalize actions that are ''consistent'' with the ground-truth action, even though the actions are not exactly the same (but lead to the same next state).

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like forward-consistent loss, issues related to the number of steps needed to reach a certain goal become of interest since different goals require different number of steps. To address this, the paper pairs the GSP with a goal recognizer (as an optimizer) to determines whether the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram (d) showing the forward-consistent loss proposed in this paper.

The paper refers to this method as zero-shot, as the agent never has access to expert actions regardless of being in the training or task demonstration phase. This is different from one-shot imitation learning, where agents have full knowledge of actions and expert demos during the training phase. The agent learns to imitate instead of learning by imitation. The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation, and a series of navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

Imitation Learning: The two main threads are behavioral cloning and inverse reinforcement learning. For recent work in imitation learning, it required the expert actions to expert actions. Compared with this paper, it does not need this.

Visual Demonstration: Several papers focused on relaxing this supervision to visual observations alone and the end-to-end learning improved results.

Forward/Inverse Dynamics and Consistency: Forward dynamics model for planning actions has been learned but there is not consistent optimizer between the forward and inverse dynamics.

Goal Conditioning: In this paper, systems work from high-dimensional visual inputs instead of knowledge of the true states and do not use a task reward during training.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment. This exploration data is used to learn the GSP policy.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The learned GSP policy (<math display="inline">π</math>) takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs a sequence of actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> starting from the current observation <math display="inline">x_i</math>. The states (observations) <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and need not be consecutive. Given the start and stop states, the number of actions <math display="inline">K</math> is also known. <math display="inline">π</math> can be though of as a deep network with parameters <math display="inline">θ_π</math>.

At test time, the expert demonstrates a task from which the agent captures a sequence of observations. This set of images is denoted by <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math>. The sequence needs to have at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses its learned policy to start from initial state <math display="inline">x_0</math> and generate actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to follow the observations in <math display="inline">D</math>.

The agent does not have access to the sequence of actions performed by the expert. Hence, it must use the observations to determine if it has reached the goal. A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the current goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. Then after reaching sufficiently close to <math display="inline">x_1^d</math>, the agent sets <math display="inline">x_2^d</math> as the goal and executes actions. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

In this section, first, the one-step version GSP policy is described. Next, it is extend it to the multi-step version.

A one-step trajectory can be described as <math display="inline">(x_t; a_t; x_{t+1})</math>. Given <math display="inline">(x_t, x_{t+1})</math> the GSP policy estimates an action, <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math>. During training, cross-entropy loss is used to learn GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">a_t</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted actions respectively. The conditional distribution <math display="inline">p</math> is not readily available so it needs to be empirically approximated using the data. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math>; given a specific input, the network outputs a single output. However, in this problem multiple actions can lead to the same output. Multiple outputs given a single input can be modeled using a variation auto-encoder. However, the authors use a different approach explained in sections 2.2-2.4 and in the following sections.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (the observation from executing the action predicted by GSP <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients (for actions that result in the same next observation) and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss, we need a differentiable "forward dynamics" model that can reliably predict results of an action. The forward dynamics <math display="inline">f</math> are learned from the data by another model. Given an observation and the action performed, <math display="inline">f</math> predicts the next observation, <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency with the GSP network. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust. This paper incorporates this by making the GSP predict feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math> rahter than the input space.

Learning the two models <math>θ_π,θ_f</math> simultaneously from scratch can cause noisier gradient updates. This is addressed by pre-training the forward model with the first term and GSP separately by blocking gradient flow. Fine-tuning is then done with <math>θ_π,θ_f</math> jointly.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem that given an observation <math>x_i</math>, goal <math>x_g</math> infers whether <math>x_i</math> is close to <math>x_g</math>. Goal observations is draw at random from the agent's experience due to lack of expert supervision of the goals, using those observations is because they are feasible. Additionally, a maximum number of iterations is also used to prevent the sequence of actions from getting too long.

The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss.

The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forwarding consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image has no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration
has no overlap with its current observation. Even under this condition, the robot is able to move closer
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space.

Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 650px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work. Future work would also explore how multiple tasks can be combined into a single model, where different tasks might come from different contexts. Finally, it would be exciting to explore explicit handling of domain shift in future work, so as to handle large differences in embodiment and learn skills directly from videos of human demonstrators obtained, for example, from the Internet.

==Critique==
1. The paper is well written and could be easily understood. In addition, the experimental evaluations are promising. Also, the proposed method is a novel and interesting so that it could be used as an alternative to pure RL.

2. In the paper, the authors didn't mention clearly why zero-shot imitation instead of a trained reinforcement learning model should be used. So, they need to provide more details about this issue.

3. It is surprised that experimental evaluations on real robots. However, the scalability of this paper is not demonstrated, how to extend it to higher dimensional action spaces and whether it is expensive in high dimensional action spaces.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In ICML, 2017.

File:co teaching data.png

2018-12-01T02:00:36Z

J385chen:

Co-Teaching

2018-12-01T01:59:48Z

J385chen:

=Introduction=
==Title of Paper==
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels
==Contributions==
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously, which focuses on training on selected clean instances and avoids estimating the noise transition matrix. In addition, using stochastic optimization with momentum to train the deep networks and clean data can be memorized by nonlinear deep networks, which becomes robust. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art baselines

==Terminology==
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data.

Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels. This can result in false positives.

=Intuition=
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually when learning on a batch of noisy data, only the error from the network itself is transferred back to facilitate learning. But in the case of Co-teaching, the two networks are able to filter different type of errors, and flow back to itself and the other network. As a result, the two models learn together, from the network itself and the partner network.

=Motivation=
The paper draws motivation from two key facts:
• That many data collection processes yield noisy labels.
• That deep neural networks have a high capacity to overfit to noisy labels.
Because of these facts, it is challenging to train deep networks to be robust with noisy labels.
=Related Works=

1. Statistical learning methods: Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling. In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators.

2. Deep learning methods: There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously.
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators.

3. Learning to teach methods: It is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. Most works did not account for noisy labels, with exception to MentorNet, which applied the idea on data with noisy labels.

=Co-Teaching Algorithm=

[[File:Co-Teaching_Algorithm.png|600px|center]]

The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch using mini-batch gradient descent, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network. <math>R(T)</math> governs the percentage of small-loss instances to be used in updating the parameters of each network.

=Summary of Experiment=
==Proposed Method==
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network.
[[File:Co-Teaching Fig 1.png|600px|center]]
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate).
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself.
==Dataset Corruption==
The datasets incorporated by this paper include MNIST, CIFAR-10 and CIFAR-100. A summary of these datasets are shown as below.

[[File:co_teaching_data.png|600px|center]]

To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix<math>Q</math>, where where <math>Q_{ij} = Pr(\widetilde{y} = j|y = i)</math> given that noisy <math>\widetilde{y}</math> is flipped from clean <math>y</math>. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry.
[[File:Co-Teaching Fig 2.png|600px|center]]
Three noise conditions are simulated for comparing co-teaching with baseline methods.

Note: Corruption of Dataset here means randomly choosing a wrong label instead of the target label by applying noise.

{| class="wikitable"
{| border="1" cellpadding="3"
|-
|width="60pt"|Method
|width="100pt"|Noise Rate
|width="700pt"|Rationale
|-
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels.
|-
| Symmetry || 50% || Half of the instances have noisy labels. Labels have a constant probability of being corrupted. Further rationale can be found at [1].
|-
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario.
|}
|}

==Baseline Comparisons==
The co-teaching method is compared with several baseline approaches, which have varying:
• proficiency in dealing with a large number of classes,
• ability to resist heavy noise,
• need to combine with specific network architectures, and
• need to be pretrained.

[[File:Co-Teaching Fig 3.png|600px|center]]
===Bootstrap===
The general idea behind bootstrapping is to dynamically change (correct) noisy labels during training. The idea is to take a value derived from the original and predicted class. The final label is some convex combination of the two. It should be noted that the weighting of the prediction is increased over time to account for the model itself improving. Of course, this procedure needs to be finely tuned to prevent it from rampantly changing correct labels before it becomes accurate. [2].

===S-Model===
Using an additional softmax layer to model the noise transition matrix [3].
===F-Correction===
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].
===Decoupling===
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].
===MentorNet===
A mentor network is weights the probability of data instances being clean/noisy in order to train the student network on cleaner instances [6].

As shown in the above table - few of the advantages of Co-teaaching method include - Co-teaching
method does not rely on any specific network architectures, which can also deal with a large number of classes and is more robust to noise. Besides, it can be trained from scratch.This makes teaching more appealing for practical usage.

==Implementation Details==
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization.

[[File: Co-Teaching Table 3.png|center]]
=Results and Discussion=
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows.
==MNIST==
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations

The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet.

Robustness of the proposed method to noise which plays an important rule in the evaluation, is evident in the plots which is better or comparable to the other methods.

[[File:Co-Teaching Table 4.png|550px|center]]

[[File:Co-Teaching Graphs MNIST.PNG|center]]

==CIFAR10==
[[File:Co-Teaching Table 5.png|550px|center]]

[[File:Co-Teaching Graphs CIFAR10.PNG|center]]
==CIFAR100==
[[File:Co-Teaching Table 6.png|550px|center]]

[[File: Co-Teaching Graphs CIFAR100.PNG|center]]
==Choice of R(T) and <math> \tau</math>==
There were some principles they followed when it came to choosing R(T) and <math> \tau</math>. R(T)=1, there was no instance needed at the beginning. They could safely update parameters in the early stage using the whole noise data since the deep neural networks would not memorize the noisy data. However, they need to drop more instances at the later stage. Because the model would eventually try to fit noisy data.

R(T)=1-<math> \tau </math> *min{<math>T^{c}/T_{k},1 </math>} with <math> \tau=\epsilon </math>, where <math> \epsilon </math> is noise level.
In this case, we consider c={0.5,1,2}. From Table 7, the test accuracy is stable.

[[File: Co-Teaching Table 7.png|550px|center]]

For <math> \tau</math>, we consider <math> \tau={0.5,0.75,1,1.25,1.5}\epsilon</math>. From Table 8, the performance can be improved with dropping more instances.
[[File: Co-Teaching Table 8.png|550px|center]]

=Conclusions=
The main goal of the paper is to introduce the “Co-teaching” learning paradigm that uses two deep neural networks learning simultaneously to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depending on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).

=Future Work=
For future work, the paper can be extended in following ways: First , the the Co-teaching program can be adapted to train deep models under weak supervisions , e.g positive and unlabeled data. Second theoretical guarantees for Co-teaching can be investigated. Further , there is no analysis for generalization performance on deep learning with noisy labels which can also be studied in future.

=Critique=
The paper evaluate the performance considering the complexity of computations and implementations of the algorithms. Co-teaching methodology seems an interesting idea but can possible become tricky to implement. Technically, such complexity can demage the performance of the algorithm.
==Lack of Task Diversity==
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality.
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm.
==Lack of Theoretical Development (Mentioned in conclusion)==
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.

=References=
[1] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The
importance of being unhinged. In NIPS, 2015.

[2] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural
networks on noisy labels with bootstrapping. In ICLR, 2015.

[3] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer.
In ICLR, 2017.

[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to
label noise: A loss correction approach. In CVPR, 2017.

[5] E. Malach and S. Shalev-Shwartz. Decoupling" when to update" from" how to update". In
NIPS, 2017.

[6] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum
for very deep neural networks on corrupted labels. In ICML, 2018.

Learning to Teach

2018-11-30T22:43:31Z

J385chen: /* Problem Definition */

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half
of the training data to train a ResNet model as the student.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
In supervised learning, the goal is to choose a function <math display="inline">f_w(x)</math> with <math display="inline">w</math> as the parameter vector to predict the supervisor's label as good as possible. The goodness of a function <math display="inline">f_w</math> is evaluated by the risk function:

\begin{align*}R(w) = \int M(y, f_w(x))dP(x,y)\end{align*}

where <math display="inline">M(,)</math> is the metric which evaluate the gap between the label and the prediction.

The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.
In contrast to traditional machine learning, which is only concerned with the student model in the
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide
appropriate inputs to the student model so that it can achieve low risk functional as efficiently
as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,
the teacher model can be used to teach either
new student models, or the same student
models in new learning scenarios such as another
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations
S are the same across different student
models and different scenarios. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is

<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center>

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).

Student model features include the signals reflecting how well current neural network is trained. The authors collect several simple features, such as passed mini-batch number (i.e., iteration), the average historical training loss and historical validation accuracy.

Some additional features are collected to represent the combination of both data and learner model. By using these features, the authors aim to represent how important the arrived training data is for current leaner. The authors mainly use three parts of such signals in our classification tasks: 1) the predicted probabilities of each class; 2) the loss value on that data, which appears frequently in self-paced learning (Kumar et al., 2010; Jiang et al., 2014a; Sachan & Xing, 2016); 3) the margin value.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.

::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.
===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.

End to end Active Object Tracking via Reinforcement Learning

2018-11-30T22:19:20Z

J385chen: /* A3C Algorithm */

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene, meaning that there is no need for camera control during tracking. Although passive tracking is very useful and well-researched with existing works, it is not applicable in situations like tracking performed by a camera-mounted mobile robot or by a drone.
On the other hand, active tracking involves two subtasks, including 1) Object Tracking and 2) Camera Control. It is difficult to jointly tune the pipeline between these two separate subtasks. Object Tracking may require human efforts for bounding box labeling. In addition, Camera Control is non-trivial, which can lead to many expensive trial-and-errors in the real world.

To address these challenges, this paper presents an end-to-end active tracking solution via deep reinforcement learning. More specifically, the ConvNet-LSTM network takes raw video frames as input and outputs camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e. the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action. Then, the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the virtual environment is then tested on a real-world video dataset to assess the generalizability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking submodules. Now we can train the entire system together as the error needs to be propagated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in the case of DQN). The training of these CNN happens concurrently with the Q feedforward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

:Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

:Long-Term Tracking is the task to recognize and track an object as it “moves in and out of a camera’s field of view”. This task is made difficult by problems such as an object reappearing into the scene and changing its appearance, scale, or illumination. Kalal et al. 2012 proposed a unified tracking framework (TLD) that accomplishes long-term tracking by “decomposing the task into tracking, learning, and detection”. Specifically, “the tracker follows an object from frame-to-frame; the detector localizes the object’s appearances; and, the learner improves the detector by learning from errors.” Altogether, the TLD framework outperforms previous state-of-arts tracking approaches [6].

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

:In recent year, Deep Learning approaches are gaining prominence in the field of object tracking. For example, Wang et al. 2013 obtain outstanding results using a deep-learning based algorithm that combines offline feature extraction and online tracking using stacked denoising autoencoders. Whereas, Wang et al. 2016 introduced a sequential training convolutional network that can efficiently transfer offline learned features for online visual tracking applications.

7) Pixel-level image classification.

:Object identification is essentially pixel level classification, where each pixel in the image is given a label. It is a more general form of image classification. In recent years, CNN has advanced many benchmarks in this field, and some AutoML methods, such as Neural Architecture Search has been applied in this field and achieved state of the art.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. An Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. For efficient training, data augmentation techniques and a customized reward function were used. An RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action <math>a_t</math> from the following set of 6 actions.

\[A = \{\text{turn-left}, \text{turn-right}, \text{turn-left-and-move-forward},\\ \text{turn-right-and-move-forward}, \text{move-forward}, \text{no-op}\}\]

The action is processed by the environment, which returns to the agent the current reward as well as the updated screen frame <math>(r_t, s_{t+1}) </math>.
==Tracking Scenarios==
It is impossible to train the desired end-to-end active tracker
in real-world scenarios. Therefore, The following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.

==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state <math>s_{t}</math> (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression:

(1) <math>\theta\leftarrow\theta+\alpha(R_t-V(s_t))\bigtriangledown_\theta log\pi(a_t|s_t)+\beta\bigtriangledown_\theta H(\pi(.|s_t))</math>

(2) <math>\theta'\leftarrow\theta'-\alpha\bigtriangledown_{\theta'}(1/2)(R_t-V(S_t))^2</math>

Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

This is a multi-threaded training process where each thread maintains an independent environment-agent interaction. Nevertheless, the network parameters are shared among threads and are updated asynchronously every T-time step using eq 1 in a lock-free manner in each thread. This multi-thread training is known to be fast and stable and it also improves the generalization(Mnih et al., 2016).

[[File:A3C High Level Diagram.png|thumb|center|High-level diagram that depicts the A3C algorithm across <math>n</math> environments]]

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distanced and exhibits no rotation.
==Environment Augmentation==
To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments.

For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode. This makes the generalization ability of the tracker be improved.

For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen and making them invisible. Simultaneously, every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.

=Experimental Settings=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(without the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. Compared to RandomizedEnv, SingleEnv does not exploit the capacity of the network better. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2. These are 8 more challenging test environments that present different target appearances, different backgrounds, more varied paths and distracting targets comparing to the training environment.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.

Testing in the UE environment is tabulated in Table 5. Four different environments are tested and based on the long-term TLD tracker.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

Action Saliency Map: An input frame is fed into the tracker and forwarded to output the policy function. An action will be sampled subsequently. Then the gradient of this action is propagated backwards to the input layer, the saliency map is generated. According to the saliency map, how the input image affects the tracker's action can be observed. Fig. 8 shows the tracker indeed learns how to find the target, which improves the performance of the model.
[[File:fig8.PNG|400px|center]]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for benchmarking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in the real world because the tracker cannot adapt to the different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transferability tailored to the real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.

[6] Kalal, Zdenek, Mikolajczyk, Krystian, and Matas, Jiri. Tracking- learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.

[7] Wang, Naiyan and Yeung, Dit-Yan. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, 2013.

[8] Wang, Lijun, Ouyang, Wanli, Wang, Xiaogang, and Lu, Huchuan. Stct: Sequentially training convolutional networks for visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381, 2016.

End to end Active Object Tracking via Reinforcement Learning

2018-11-30T22:15:12Z

J385chen: /* A3C Algorithm */

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene, meaning that there is no need for camera control during tracking. Although passive tracking is very useful and well-researched with existing works, it is not applicable in situations like tracking performed by a camera-mounted mobile robot or by a drone.
On the other hand, active tracking involves two subtasks, including 1) Object Tracking and 2) Camera Control. It is difficult to jointly tune the pipeline between these two separate subtasks. Object Tracking may require human efforts for bounding box labeling. In addition, Camera Control is non-trivial, which can lead to many expensive trial-and-errors in the real world.

To address these challenges, this paper presents an end-to-end active tracking solution via deep reinforcement learning. More specifically, the ConvNet-LSTM network takes raw video frames as input and outputs camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e. the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action. Then, the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the virtual environment is then tested on a real-world video dataset to assess the generalizability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking submodules. Now we can train the entire system together as the error needs to be propagated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in the case of DQN). The training of these CNN happens concurrently with the Q feedforward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

:Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

:Long-Term Tracking is the task to recognize and track an object as it “moves in and out of a camera’s field of view”. This task is made difficult by problems such as an object reappearing into the scene and changing its appearance, scale, or illumination. Kalal et al. 2012 proposed a unified tracking framework (TLD) that accomplishes long-term tracking by “decomposing the task into tracking, learning, and detection”. Specifically, “the tracker follows an object from frame-to-frame; the detector localizes the object’s appearances; and, the learner improves the detector by learning from errors.” Altogether, the TLD framework outperforms previous state-of-arts tracking approaches [6].

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

:In recent year, Deep Learning approaches are gaining prominence in the field of object tracking. For example, Wang et al. 2013 obtain outstanding results using a deep-learning based algorithm that combines offline feature extraction and online tracking using stacked denoising autoencoders. Whereas, Wang et al. 2016 introduced a sequential training convolutional network that can efficiently transfer offline learned features for online visual tracking applications.

7) Pixel-level image classification.

:Object identification is essentially pixel level classification, where each pixel in the image is given a label. It is a more general form of image classification. In recent years, CNN has advanced many benchmarks in this field, and some AutoML methods, such as Neural Architecture Search has been applied in this field and achieved state of the art.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. An Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. For efficient training, data augmentation techniques and a customized reward function were used. An RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action <math>a_t</math> from the following set of 6 actions.

\[A = \{\text{turn-left}, \text{turn-right}, \text{turn-left-and-move-forward},\\ \text{turn-right-and-move-forward}, \text{move-forward}, \text{no-op}\}\]

The action is processed by the environment, which returns to the agent the current reward as well as the updated screen frame <math>(r_t, s_{t+1}) </math>.
==Tracking Scenarios==
It is impossible to train the desired end-to-end active tracker
in real-world scenarios. Therefore, The following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.

==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state <math>s_{t}</math> (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression:

(1) <math>\theta\leftarrow\theta+\alpha(R_t-V(s_t))\bigtriangledown_\theta log\pi(a_t|s_t)+\beta\bigtriangledown_\theta H(\pi(.|s_t))</math>

(2) <math>\theta'\leftarrow\theta'-\alpha\bigtriangledown_{\theta'}(1/2)(R_t-V(S_t))^2</math>

Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

This is a multi-threaded training process where each thread maintains an independent environment-agent interaction. Nevertheless, the network parameters are shared among threads and are updated asynchronously every T-time step using eq 1 in a lock-free manner in each thread.

[[File:A3C High Level Diagram.png|thumb|center|High-level diagram that depicts the A3C algorithm across <math>n</math> environments]]

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distanced and exhibits no rotation.
==Environment Augmentation==
To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments.

For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode. This makes the generalization ability of the tracker be improved.

For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen and making them invisible. Simultaneously, every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.

=Experimental Settings=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(without the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. Compared to RandomizedEnv, SingleEnv does not exploit the capacity of the network better. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2. These are 8 more challenging test environments that present different target appearances, different backgrounds, more varied paths and distracting targets comparing to the training environment.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.

Testing in the UE environment is tabulated in Table 5. Four different environments are tested and based on the long-term TLD tracker.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

Action Saliency Map: An input frame is fed into the tracker and forwarded to output the policy function. An action will be sampled subsequently. Then the gradient of this action is propagated backwards to the input layer, the saliency map is generated. According to the saliency map, how the input image affects the tracker's action can be observed. Fig. 8 shows the tracker indeed learns how to find the target, which improves the performance of the model.
[[File:fig8.PNG|400px|center]]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for benchmarking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in the real world because the tracker cannot adapt to the different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transferability tailored to the real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.

[6] Kalal, Zdenek, Mikolajczyk, Krystian, and Matas, Jiri. Tracking- learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.

[7] Wang, Naiyan and Yeung, Dit-Yan. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, 2013.

[8] Wang, Lijun, Ouyang, Wanli, Wang, Xiaogang, and Lu, Huchuan. Stct: Sequentially training convolutional networks for visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381, 2016.

Learning to Teach

2018-11-30T22:01:41Z

J385chen:

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half
of the training data to train a ResNet model as the student.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.
In contrast to traditional machine learning, which is only concerned with the student model in the
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide
appropriate inputs to the student model so that it can achieve low risk functional as efficiently
as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,
the teacher model can be used to teach either
new student models, or the same student
models in new learning scenarios such as another
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations
S are the same across different student
models and different scenarios. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is

<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center>

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).

Student model features include the signals reflecting how well current neural network is trained. The authors collect several simple features, such as passed mini-batch number (i.e., iteration), the average historical training loss and historical validation accuracy.

Some additional features are collected to represent the combination of both data and learner model. By using these features, the authors aim to represent how important the arrived training data is for current leaner. The authors mainly use three parts of such signals in our classification tasks: 1) the predicted probabilities of each class; 2) the loss value on that data, which appears frequently in self-paced learning (Kumar et al., 2010; Jiang et al., 2014a; Sachan & Xing, 2016); 3) the margin value.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.

::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.
===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.

Learning to Teach

2018-11-30T21:59:53Z

J385chen:

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.
Further more , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half
of the training data to train a ResNet model as the student.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.
In contrast to traditional machine learning, which is only concerned with the student model in the
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide
appropriate inputs to the student model so that it can achieve low risk functional as efficiently
as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,
the teacher model can be used to teach either
new student models, or the same student
models in new learning scenarios such as another
subset <math> A_{test} </math>is provided.Such a generalization is feasible as long as the state representations
S are the same across different student
models and different scenarios. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of teacher in the L2T framework is

<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center>

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.

::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.
===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.

Reinforcement Learning of Theorem Proving

2018-11-30T03:35:46Z

J385chen:

== Introduction ==
Automated reasoning over mathematical proof was a major motivation for the development of computer science. Automated theorem provers (ATP) can in principle be used to attack any formally stated mathematical problem and is a research area that has been present since the early 20th century [1]. As of today, state-of-art ATP systems rely on the fast implementation of complete proof calculi. such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result,
in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.

In this paper, the authors propose a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi are the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It is shown that reinforcement learning results in a 42.1% performance increase compared to the base prover (without learning).

== Related Work ==
C. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Since the MaLARea [8] system, number of iterations of a feedback loop between proving and learning have been explored, remarkably improving over human-designed heuristics when reasoning in large theories. However, such systems are known to only learn a high-level selection of relevant facts from a large knowledge base and delegate the internal proof search to standard ATP systems. S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However, they chose deep neural network (CNNs and RNNs) as feature extractor. These systems are treated as black boxes in literature with not much understanding of their performances possible.

In leanCoP [9], one of the simpler connection tableau systems, the next tableau extension step could be selected using supervised learning. In addition, the first experiments with Monte-Carlo guided proof search [5] have been done for connection tableau systems. The improvement over the baseline measured in that work is much less significant than here. This is closest to the authors' approach but the performance is poorer than this paper.

On a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [6], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights into how to process datasets of formally stated mathematical problems.

== First Order Logic and Connection Calculi ==
Here we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. Let us try to prove the following first-order sentence.

[[file:fof_sentence.png|frameless|450px|center]]

This sentence can be transformed into a formula in Skolemized Disjunctive Normal Form (DNF), which is referred to as the "matrix".

[[file:skolemized_dnf.png|frameless|450px|center]]
[[file:matrix.png|frameless|center]]

The original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempt to show that the Skolemized DNF formula is a tautology by constructing a tableau. We will start at the special node, root, which is an open leaf. At each step, we select a clause (for example, clause <math display="inline">P \wedge R</math> is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from the root to this leaf. If two literals on this path are unifiable (for example, <math display="inline">Qx'</math> is unifiable with <math display="inline">\neg Qc</math>), this leaf is then closed. An example of a closed tableaux is shown in Figure 1. In standard terminology, it states that a connection is found on this branch.

[[file:tableaux_example.png|thumb|center|Figure 1. An example of closed tableaux. Adapted from [2]]]

The paper's goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, the paper has shown that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.

In formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as <math display="inline">subgoal, M, path</math> where <math display="inline">subgoal</math> is a list of literals that we need to find connection later, <math display="inline">M</math> stands for the matrix, and <math display="inline">path</math> stands for the path leading to this leaf.

[[file:formal_calculi.png|thumb|600px|center|Figure 2. Formal connection calculi. Adapted from [2].]]
[[file:formal_tableaux.png|thumb|600px|center|Figure 3. Formal tableaux constructed from the example sentence. Adapted from [2].]]

To sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function ''prove(subgoal, M, path)'' is stated as follows:
* If ''subgoal'' is empty
** return ''TRUE''
* If reduction is possible
** Perform reduction, generating ''new_subgoal'', ''new_path''
** return ''prove(new_subgoal, M, new_path)''
* For all clauses in ''M''
** If a clause can do extension with ''subgoal''
** Perform extension, generating ''new_subgoal1'', ''new_path'', ''new_subgoal2''
** return ''prove(new_subgoal1, M, new_path)'' and ''prove(new_subgoal2, M, path)''
* return ''FALSE''

It is important to note that the bare prover implemented in this paper is incomplete. Here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause <math display="inline">P(0)</math> be the first subgoal. Clearly choosing <math display="inline">\neg P(0)</math> to extend will complete the proof.

[[file:pathological.png|frameless|400px|center]]

However, if we choose <math display="inline">\neg P(x) \lor P(s(x))</math> to do extension, the algorithm will generate an infinite branch <math display="inline">P(0), P(s(0)), P(s(s(0))) ...</math>. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.

A technique called iterative deepening can be used to avoid such infinite loop, making the bare prover complete. Iterative deepening will force the prover to try all shorter proofs before moving into long ones, it is effective, but also waste valuable computing resource trying to enumerate all short proofs.

In addition, the provability of first-order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.

== Mizar Math Library ==
Mizar Math Library (MML) [7, 10] is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem "If <math display="inline"> p </math> is prime, then <math display="inline"> \sqrt p </math> is irrational."

[[file:mizar_article.png|thumb|center|Figure 4. An article from MML. Adapted from [6].]]

The training and testing data for this paper is a subset of MML, the Mizar40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar40 library, it states that with ''d3_xboole_0'' and ''t3_xboole_0'' as premises, we can prove ''t5_xboole_0''.

[[file:mizar40_0.png|frameless|400px|center]]
[[file:mizar40_1.png|frameless|600px|center]]
[[file:mizar40_2.png|frameless|600px|center]]
[[file:mizar40_3.png|frameless|600px|center]]

== Monte Carlo Guidance ==

Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. Then the expansion will then be used to weight the node in the search tree.

In the reinforcement learning setting, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state <math display="inline"> i </math> needs to maintain three parameters, its prior probability <math display="inline"> p_i </math>, its total reward <math display="inline"> w_i </math>, and number of its visits <math display="inline"> n_i </math>. If no policy learning is used, the prior probabilities are all equal to one.

A simple heuristic is used to estimate the future reward of leaf states: suppose leaf state <math display="inline"> i </math> has <math display="inline"> G_i </math> open subgoals, the reward is computed as <math display="inline"> 0.95 ^ {G_i} </math>. This will be replaced once value learning is implemented.

The standard UCT formula is chosen to select the next actions in the playouts
\begin{align}
{\frac{w_i}{n_i}} + 2 \cdot p_i \cdot {\sqrt{\frac{\log N}{n_i}}}
\end{align}
where <math display="inline"> N </math> stands for the total number of visits of the parent node.

The bare prover is asked to play <math display="inline"> b </math> playouts of length <math display="inline"> d </math> from the empty tableaux, each playout backpropagates the values of proof states it visits. After these <math display="inline"> b </math> playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next <math display="inline"> b </math> playouts will start from this tableaux, followed by another bigstep, etc.

== Policy Learning and Guidance ==

From many runs of MCT, we will know the optimal prior probability of actions (inferences) in particular proof states, we can extract the frequency of each action <math display="inline"> a </math>, and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion <math display="inline"> r_a \in (0, \infty) </math>. We characterize the proof states for policy learning by extracting human-engineered features. Also, we characterize actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector <math display="inline"> (f_s, f_a) </math>.

The feature vector <math display="inline"> (f_s, f_a) </math> is regressed against the associated <math display="inline"> r_a </math>.

During the proof search, the prior probabilities <math display="inline"> p_i </math> of available actions <math display="inline"> a_i </math> in a state <math display="inline"> s </math> is computed as the softmax of their predictions.

Training examples are only extracted from big step states, making the amount of training data manageable.

== Value Learning and Guidance ==

Bigstep states are also used for proof state evaluation. For a proof state <math display="inline"> s </math>, if it corresponds to a successful proof, the value is assigned as <math display="inline"> v_s = 1 </math>. If it corresponds to a failed proof, the value is assigned as <math display="inline"> v_s = 0 </math>. For other scenarios, denote the distance between state <math display="inline"> s </math> and a successful state as <math display="inline"> d_s </math>, then the value is assigned as <math display="inline"> v_s = 0.99^{d_s} </math>

Proof state feature <math display="inline"> f_s </math> is regressed against the value <math display="inline"> v_s </math>. During the proof search, the reward of leaf states are computed from this prediction.

== Features and Learners ==
For proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the resulting integer is reduced to a smaller feature space by taking modulo by a large prime.

For actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.

In addition to the term walks, they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.

The whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.

== Experimental Results ==
The authors use the M2k dataset to compare the performance of mlCoP, the bare prover and rlCoP using only UCT.
*Performance without Learning
Table 3 shows the baseline result. The Performance of the bare prover is significantly lower than mlCoP and rlCoP without policy/value.
[[file:table3.png|550px|center]]
*Reinforcement Learning of Policy Only
In this experiment, the authors evaluated on the dataset rlCoP with UCT using policy learning only. They used the policy training data from previous iterations to train a new predictor after each iteration. Which means only the first iteration ran without policy while all the rest iterations used previous policy training data.
From Table 4, rlCoP is better than mlCoP run with the much higher <math>4 ∗ 10^{6}</math> inference limit after fourth iteration.
[[file:table4.png|550px|center]]
*Reinforcement Learning of Value Only
This experiment was similar to the last one, however, they used only values rather than learned policy. From Table 5, the performance of rlCoP is close to mlCoP but below it after 20 iterations, and it is far below rlCoP using only policy learning.
[[file:table5.png|550px|center]]
*Reinforcement Learning of Policy and Value
From Table 6, the performance of rlCoP is 19.4% more than mlCoP with <math>4 ∗ 10^{6}</math> inferences, 13.6% more than the best iteration of rlCoP with policy only, and 44.3% more than the best iteration of rlCoP with value only after 20 iterations.
[[file:table6.png|550px|center]]
Besides, they also evaluated the effect of the joint reinforcement learning of both policy and value. Replacing final policy and value with the best one from policy-only or value-only both decreased performance.

*Evaluation on the Whole Miz40 Dataset.
The authors split Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, ''mlCoP'' represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and ''bare prover'' stands for the prover implemented in this paper, without MCT guidance.

[[file:atp_result0.jpg|frane|550px|center|Figure 5a. Experimental result on Mizar40 dataset]]
[[file:atp_result1.jpg|frame|550px|center|Figure 5b. More experimental result on Mizar40 dataset]]

As shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]

Besides these results, there were also found that some test problems could be solved with rlCoP easily but mlCoP could not.

== Conclusions ==
In this work, the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]

The authors pose that some future research could include strong learning algorithms to characterize mathematical data. The development of suitable deep learning architectures will help the algorithm characterize semantic and syntactic features of mathematical objects which will be crucial to create strong assistants for mathematics and hard sciences.

== Critiques ==
Until now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.

The feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior to human-engineered features, which is also shown in [4, 6].

Also, the policy-value learning iteration is quite inefficient. The learning loop is:
* Loop
** Run MCT with the previous model on an entire dataset
** Collect MCT data
** Train a new model
If we adopt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.

The experimental design of this paper has some flaws. The authors compare the performance of ''mlCoP'' and ''rlCoP'' by limiting them to the same number of inference steps. However, every inference step of ''rlCoP'' requires additional machine learning prediction, which costs more time. A better way to compare their performance is to set a time limit.

It would also be interesting to study automated theorem proving in another logic system, like high order logic, because many mathematical concepts can only be expressed in higher-order logic.

== References ==
[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.

[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.

[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.

[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.

[5] M. F¨arber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor,
26th International Conference on Automated Deduction (CADE), volume 10395 of LNCS,
pages 563–579. Springer, 2017.

[6] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.

[7] Mizar Math Library. http://mizar.org/library/

[8] J. Urban, G. Sutcliffe, P. Pudla ́k, and J. Vyskocˇil. MaLARea SG1 - Machine Learner for Automated Reasoning with Semantic Guidance. In A. Armando, P. Baumgartner, and G. Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.

[9] J. Otten and W. Bibel. leanCoP: lean connection-based theorem proving. J. Symb. Comput., 36(1-2):139–161, 2003.

[10] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Rea-
soning, 3(2):153–245, 2010

Reinforcement Learning of Theorem Proving

2018-11-30T03:32:34Z

J385chen:

== Introduction ==
Automated reasoning over mathematical proof was a major motivation for the development of computer science. Automated theorem provers (ATP) can in principle be used to attack any formally stated mathematical problem and is a research area that has been present since the early 20th century [1]. As of today, state-of-art ATP systems rely on the fast implementation of complete proof calculi. such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result,
in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.

In this paper, the authors propose a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi are the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It is shown that reinforcement learning results in a 42.1% performance increase compared to the base prover (without learning).

== Related Work ==
C. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Since the MaLARea [8] system, number of iterations of a feedback loop between proving and learning have been explored, remarkably improving over human-designed heuristics when reasoning in large theories. However, such systems are known to only learn a high-level selection of relevant facts from a large knowledge base and delegate the internal proof search to standard ATP systems. S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However, they chose deep neural network (CNNs and RNNs) as feature extractor. These systems are treated as black boxes in literature with not much understanding of their performances possible.

In leanCoP [9], one of the simpler connection tableau systems, the next tableau extension step could be selected using supervised learning. In addition, the first experiments with Monte-Carlo guided proof search [5] have been done for connection tableau systems. The improvement over the baseline measured in that work is much less significant than here. This is closest to the authors' approach but the performance is poorer than this paper.

On a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [6], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights into how to process datasets of formally stated mathematical problems.

== First Order Logic and Connection Calculi ==
Here we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. Let us try to prove the following first-order sentence.

[[file:fof_sentence.png|frameless|450px|center]]

This sentence can be transformed into a formula in Skolemized Disjunctive Normal Form (DNF), which is referred to as the "matrix".

[[file:skolemized_dnf.png|frameless|450px|center]]
[[file:matrix.png|frameless|center]]

The original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempt to show that the Skolemized DNF formula is a tautology by constructing a tableau. We will start at the special node, root, which is an open leaf. At each step, we select a clause (for example, clause <math display="inline">P \wedge R</math> is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from the root to this leaf. If two literals on this path are unifiable (for example, <math display="inline">Qx'</math> is unifiable with <math display="inline">\neg Qc</math>), this leaf is then closed. An example of a closed tableaux is shown in Figure 1. In standard terminology, it states that a connection is found on this branch.

[[file:tableaux_example.png|thumb|center|Figure 1. An example of closed tableaux. Adapted from [2]]]

The paper's goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, the paper has shown that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.

In formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as <math display="inline">subgoal, M, path</math> where <math display="inline">subgoal</math> is a list of literals that we need to find connection later, <math display="inline">M</math> stands for the matrix, and <math display="inline">path</math> stands for the path leading to this leaf.

[[file:formal_calculi.png|thumb|600px|center|Figure 2. Formal connection calculi. Adapted from [2].]]
[[file:formal_tableaux.png|thumb|600px|center|Figure 3. Formal tableaux constructed from the example sentence. Adapted from [2].]]

To sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function ''prove(subgoal, M, path)'' is stated as follows:
* If ''subgoal'' is empty
** return ''TRUE''
* If reduction is possible
** Perform reduction, generating ''new_subgoal'', ''new_path''
** return ''prove(new_subgoal, M, new_path)''
* For all clauses in ''M''
** If a clause can do extension with ''subgoal''
** Perform extension, generating ''new_subgoal1'', ''new_path'', ''new_subgoal2''
** return ''prove(new_subgoal1, M, new_path)'' and ''prove(new_subgoal2, M, path)''
* return ''FALSE''

It is important to note that the bare prover implemented in this paper is incomplete. Here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause <math display="inline">P(0)</math> be the first subgoal. Clearly choosing <math display="inline">\neg P(0)</math> to extend will complete the proof.

[[file:pathological.png|frameless|400px|center]]

However, if we choose <math display="inline">\neg P(x) \lor P(s(x))</math> to do extension, the algorithm will generate an infinite branch <math display="inline">P(0), P(s(0)), P(s(s(0))) ...</math>. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.

In addition, the provability of first-order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.

== Mizar Math Library ==
Mizar Math Library (MML) [7, 10] is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem "If <math display="inline"> p </math> is prime, then <math display="inline"> \sqrt p </math> is irrational."

[[file:mizar_article.png|thumb|center|Figure 4. An article from MML. Adapted from [6].]]

The training and testing data for this paper is a subset of MML, the Mizar40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar40 library, it states that with ''d3_xboole_0'' and ''t3_xboole_0'' as premises, we can prove ''t5_xboole_0''.

[[file:mizar40_0.png|frameless|400px|center]]
[[file:mizar40_1.png|frameless|600px|center]]
[[file:mizar40_2.png|frameless|600px|center]]
[[file:mizar40_3.png|frameless|600px|center]]

== Monte Carlo Guidance ==

Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. Then the expansion will then be used to weight the node in the search tree.

In the reinforcement learning setting, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state <math display="inline"> i </math> needs to maintain three parameters, its prior probability <math display="inline"> p_i </math>, its total reward <math display="inline"> w_i </math>, and number of its visits <math display="inline"> n_i </math>. If no policy learning is used, the prior probabilities are all equal to one.

A simple heuristic is used to estimate the future reward of leaf states: suppose leaf state <math display="inline"> i </math> has <math display="inline"> G_i </math> open subgoals, the reward is computed as <math display="inline"> 0.95 ^ {G_i} </math>. This will be replaced once value learning is implemented.

The standard UCT formula is chosen to select the next actions in the playouts
\begin{align}
{\frac{w_i}{n_i}} + 2 \cdot p_i \cdot {\sqrt{\frac{\log N}{n_i}}}
\end{align}
where <math display="inline"> N </math> stands for the total number of visits of the parent node.

The bare prover is asked to play <math display="inline"> b </math> playouts of length <math display="inline"> d </math> from the empty tableaux, each playout backpropagates the values of proof states it visits. After these <math display="inline"> b </math> playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next <math display="inline"> b </math> playouts will start from this tableaux, followed by another bigstep, etc.

== Policy Learning and Guidance ==

From many runs of MCT, we will know the optimal prior probability of actions (inferences) in particular proof states, we can extract the frequency of each action <math display="inline"> a </math>, and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion <math display="inline"> r_a \in (0, \infty) </math>. We characterize the proof states for policy learning by extracting human-engineered features. Also, we characterize actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector <math display="inline"> (f_s, f_a) </math>.

The feature vector <math display="inline"> (f_s, f_a) </math> is regressed against the associated <math display="inline"> r_a </math>.

During the proof search, the prior probabilities <math display="inline"> p_i </math> of available actions <math display="inline"> a_i </math> in a state <math display="inline"> s </math> is computed as the softmax of their predictions.

Training examples are only extracted from big step states, making the amount of training data manageable.

== Value Learning and Guidance ==

Bigstep states are also used for proof state evaluation. For a proof state <math display="inline"> s </math>, if it corresponds to a successful proof, the value is assigned as <math display="inline"> v_s = 1 </math>. If it corresponds to a failed proof, the value is assigned as <math display="inline"> v_s = 0 </math>. For other scenarios, denote the distance between state <math display="inline"> s </math> and a successful state as <math display="inline"> d_s </math>, then the value is assigned as <math display="inline"> v_s = 0.99^{d_s} </math>

Proof state feature <math display="inline"> f_s </math> is regressed against the value <math display="inline"> v_s </math>. During the proof search, the reward of leaf states are computed from this prediction.

== Features and Learners ==
For proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the resulting integer is reduced to a smaller feature space by taking modulo by a large prime.

For actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.

In addition to the term walks, they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.

The whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.

== Experimental Results ==
The authors use the M2k dataset to compare the performance of mlCoP, the bare prover and rlCoP using only UCT.
*Performance without Learning
Table 3 shows the baseline result. The Performance of the bare prover is significantly lower than mlCoP and rlCoP without policy/value.
[[file:table3.png|550px|center]]
*Reinforcement Learning of Policy Only
In this experiment, the authors evaluated on the dataset rlCoP with UCT using policy learning only. They used the policy training data from previous iterations to train a new predictor after each iteration. Which means only the first iteration ran without policy while all the rest iterations used previous policy training data.
From Table 4, rlCoP is better than mlCoP run with the much higher <math>4 ∗ 10^{6}</math> inference limit after fourth iteration.
[[file:table4.png|550px|center]]
*Reinforcement Learning of Value Only
This experiment was similar to the last one, however, they used only values rather than learned policy. From Table 5, the performance of rlCoP is close to mlCoP but below it after 20 iterations, and it is far below rlCoP using only policy learning.
[[file:table5.png|550px|center]]
*Reinforcement Learning of Policy and Value
From Table 6, the performance of rlCoP is 19.4% more than mlCoP with <math>4 ∗ 10^{6}</math> inferences, 13.6% more than the best iteration of rlCoP with policy only, and 44.3% more than the best iteration of rlCoP with value only after 20 iterations.
[[file:table6.png|550px|center]]
Besides, they also evaluated the effect of the joint reinforcement learning of both policy and value. Replacing final policy and value with the best one from policy-only or value-only both decreased performance.

*Evaluation on the Whole Miz40 Dataset.
The authors split Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, ''mlCoP'' represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and ''bare prover'' stands for the prover implemented in this paper, without MCT guidance.

[[file:atp_result0.jpg|frane|550px|center|Figure 5a. Experimental result on Mizar40 dataset]]
[[file:atp_result1.jpg|frame|550px|center|Figure 5b. More experimental result on Mizar40 dataset]]

As shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]

Besides these results, there were also found that some test problems could be solved with rlCoP easily but mlCoP could not.

== Conclusions ==
In this work, the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]

The authors pose that some future research could include strong learning algorithms to characterize mathematical data. The development of suitable deep learning architectures will help the algorithm characterize semantic and syntactic features of mathematical objects which will be crucial to create strong assistants for mathematics and hard sciences.

== Critiques ==
Until now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.

The feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior to human-engineered features, which is also shown in [4, 6].

Also, the policy-value learning iteration is quite inefficient. The learning loop is:
* Loop
** Run MCT with the previous model on an entire dataset
** Collect MCT data
** Train a new model
If we adopt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.

The experimental design of this paper has some flaws. The authors compare the performance of ''mlCoP'' and ''rlCoP'' by limiting them to the same number of inference steps. However, every inference step of ''rlCoP'' requires additional machine learning prediction, which costs more time. A better way to compare their performance is to set a time limit.

It would also be interesting to study automated theorem proving in another logic system, like high order logic, because many mathematical concepts can only be expressed in higher-order logic.

== References ==
[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.

[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.

[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.

[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.

[5] M. F¨arber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor,
26th International Conference on Automated Deduction (CADE), volume 10395 of LNCS,
pages 563–579. Springer, 2017.

[6] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.

[7] Mizar Math Library. http://mizar.org/library/

[8] J. Urban, G. Sutcliffe, P. Pudla ́k, and J. Vyskocˇil. MaLARea SG1 - Machine Learner for Automated Reasoning with Semantic Guidance. In A. Armando, P. Baumgartner, and G. Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.

[9] J. Otten and W. Bibel. leanCoP: lean connection-based theorem proving. J. Symb. Comput., 36(1-2):139–161, 2003.

[10] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Rea-
soning, 3(2):153–245, 2010

Reinforcement Learning of Theorem Proving

2018-11-30T03:31:06Z

J385chen:

== Introduction ==
Automated reasoning over mathematical proof was a major motivation for the development of computer science. Automated theorem provers (ATP) can in principle be used to attack any formally stated mathematical problem and is a research area that has been present since the early 20th century [1]. As of today, state-of-art ATP systems rely on the fast implementation of complete proof calculi. such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result,
in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.

In this paper, the authors propose a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi are the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It is shown that reinforcement learning results in a 42.1% performance increase compared to the base prover (without learning).

== Related Work ==
C. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Since the MaLARea [8] system, number of iterations of a feedback loop between proving and learning have been explored, remarkably improving over human-designed heuristics when reasoning in large theories. However, such systems are known to only learn a high-level selection of relevant facts from a large knowledge base and delegate the internal proof search to standard ATP systems. S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However, they chose deep neural network (CNNs and RNNs) as feature extractor. These systems are treated as black boxes in literature with not much understanding of their performances possible.

In leanCoP [9], one of the simpler connection tableau systems, the next tableau extension step could be selected using supervised learning. In addition, the first experiments with Monte-Carlo guided proof search [5] have been done for connection tableau systems. The improvement over the baseline measured in that work is much less significant than here. This is closest to the authors' approach but the performance is poorer than this paper.

On a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [6], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights into how to process datasets of formally stated mathematical problems.

== First Order Logic and Connection Calculi ==
Here we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. Let us try to prove the following first-order sentence.

[[file:fof_sentence.png|frameless|450px|center]]

This sentence can be transformed into a formula in Skolemized Disjunctive Normal Form (DNF), which is referred to as the "matrix".

[[file:skolemized_dnf.png|frameless|450px|center]]
[[file:matrix.png|frameless|center]]

The original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempt to show that the Skolemized DNF formula is a tautology by constructing a tableau. We will start at the special node, root, which is an open leaf. At each step, we select a clause (for example, clause <math display="inline">P \wedge R</math> is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from the root to this leaf. If two literals on this path are unifiable (for example, <math display="inline">Qx'</math> is unifiable with <math display="inline">\neg Qc</math>), this leaf is then closed. An example of a closed tableaux is shown in Figure 1. In standard terminology, it states that a connection is found on this branch.

[[file:tableaux_example.png|thumb|center|Figure 1. An example of closed tableaux. Adapted from [2]]]

The paper's goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, the paper has shown that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.

In formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as <math display="inline">subgoal, M, path</math> where <math display="inline">subgoal</math> is a list of literals that we need to find connection later, <math display="inline">M</math> stands for the matrix, and <math display="inline">path</math> stands for the path leading to this leaf.

[[file:formal_calculi.png|thumb|600px|center|Figure 2. Formal connection calculi. Adapted from [2].]]
[[file:formal_tableaux.png|thumb|600px|center|Figure 3. Formal tableaux constructed from the example sentence. Adapted from [2].]]

To sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function ''prove(subgoal, M, path)'' is stated as follows:
* If ''subgoal'' is empty
** return ''TRUE''
* If reduction is possible
** Perform reduction, generating ''new_subgoal'', ''new_path''
** return ''prove(new_subgoal, M, new_path)''
* For all clauses in ''M''
** If a clause can do extension with ''subgoal''
** Perform extension, generating ''new_subgoal1'', ''new_path'', ''new_subgoal2''
** return ''prove(new_subgoal1, M, new_path)'' and ''prove(new_subgoal2, M, path)''
* return ''FALSE''

It is important to note that the bare prover implemented in this paper is incomplete. Here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause <math display="inline">P(0)</math> be the first subgoal. Clearly choosing <math display="inline">\neg P(0)</math> to extend will complete the proof.

[[file:pathological.png|frameless|400px|center]]

However, if we choose <math display="inline">\neg P(x) \lor P(s(x))</math> to do extension, the algorithm will generate an infinite branch <math display="inline">P(0), P(s(0)), P(s(s(0))) ...</math>. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.

In addition, the provability of first-order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.

== Mizar Math Library ==
Mizar Math Library (MML) [7, 10] is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem "If <math display="inline"> p </math> is prime, then <math display="inline"> \sqrt p </math> is irrational."

[[file:mizar_article.png|thumb|center|Figure 4. An article from MML. Adapted from [6].]]

The training and testing data for this paper is a subset of MML, the Mizar40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar40 library, it states that with ''d3_xboole_0'' and ''t3_xboole_0'' as premises, we can prove ''t5_xboole_0''.

[[file:mizar40_0.png|frameless|400px|center]]
[[file:mizar40_1.png|frameless|600px|center]]
[[file:mizar40_2.png|frameless|600px|center]]
[[file:mizar40_3.png|frameless|600px|center]]

== Monte Carlo Guidance ==

Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. Then the expansion will then be used to weight the node in the search tree.

In the reinforcement learning setting, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state <math display="inline"> i </math> needs to maintain three parameters, its prior probability <math display="inline"> p_i </math>, its total reward <math display="inline"> w_i </math>, and number of its visits <math display="inline"> n_i </math>. If no policy learning is used, the prior probabilities are all equal to one.

A simple heuristic is used to estimate the future reward of leaf states: suppose leaf state <math display="inline"> i </math> has <math display="inline"> G_i </math> open subgoals, the reward is computed as <math display="inline"> 0.95 ^ {G_i} </math>. This will be replaced once value learning is implemented.

The standard UCT formula is chosen to select the next actions in the playouts
\begin{align}
{\frac{w_i}{n_i}} + 2 \cdot p_i \cdot {\sqrt{\frac{\log N}{n_i}}}
\end{align}
where <math display="inline"> N </math> stands for the total number of visits of the parent node.

The bare prover is asked to play <math display="inline"> b </math> playouts of length <math display="inline"> d </math> from the empty tableaux, each playout backpropagates the values of proof states it visits. After these <math display="inline"> b </math> playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next <math display="inline"> b </math> playouts will start from this tableaux, followed by another bigstep, etc.

== Policy Learning and Guidance ==

From many runs of MCT, we will know the optimal prior probability of actions (inferences) in particular proof states, we can extract the frequency of each action <math display="inline"> a </math>, and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion <math display="inline"> r_a \in (0, \infty) </math>. We characterize the proof states for policy learning by extracting human-engineered features. Also, we characterize actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector <math display="inline"> (f_s, f_a) </math>.

The feature vector <math display="inline"> (f_s, f_a) </math> is regressed against the associated <math display="inline"> r_a </math>.

During the proof search, the prior probabilities <math display="inline"> p_i </math> of available actions <math display="inline"> a_i </math> in a state <math display="inline"> s </math> is computed as the softmax of their predictions.

Training examples are only extracted from big step states, making the amount of training data manageable.

== Value Learning and Guidance ==

Bigstep states are also used for proof state evaluation. For a proof state <math display="inline"> s </math>, if it corresponds to a successful proof, the value is assigned as <math display="inline"> v_s = 1 </math>. If it corresponds to a failed proof, the value is assigned as <math display="inline"> v_s = 0 </math>. For other scenarios, denote the distance between state <math display="inline"> s </math> and a successful state as <math display="inline"> d_s </math>, then the value is assigned as <math display="inline"> v_s = 0.99^{d_s} </math>

Proof state feature <math display="inline"> f_s </math> is regressed against the value <math display="inline"> v_s </math>. During the proof search, the reward of leaf states are computed from this prediction.

== Features and Learners ==
For proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the resulting integer is reduced to a smaller feature space by taking modulo by a large prime.

For actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.

In addition to the term walks, they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.

The whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.

== Experimental Results ==
The authors use the M2k dataset to compare the performance of mlCoP, the bare prover and rlCoP using only UCT.
*Performance without Learning
Table 3 shows the baseline result. The Performance of the bare prover is significantly lower than mlCoP and rlCoP without policy/value.
[[file:table3.png|550px|center]]
*Reinforcement Learning of Policy Only
In this experiment, the authors evaluated on the dataset rlCoP with UCT using policy learning only. They used the policy training data from previous iterations to train a new predictor after each iteration. Which means only the first iteration ran without policy while all the rest iterations used previous policy training data.
From Table 4, rlCoP is better than mlCoP run with the much higher <math>4 ∗ 10^{6}</math> inference limit after fourth iteration.
[[file:table4.png|550px|center]]
*Reinforcement Learning of Value Only
This experiment was similar to the last one, however, they used only values rather than learned policy. From Table 5, the performance of rlCoP is close to mlCoP but below it after 20 iterations, and it is far below rlCoP using only policy learning.
[[file:table5.png|550px|center]]
*Reinforcement Learning of Policy and Value
From Table 6, the performance of rlCoP is 19.4% more than mlCoP with <math>4 ∗ 10^{6}</math> inferences, 13.6% more than the best iteration of rlCoP with policy only, and 44.3% more than the best iteration of rlCoP with value only after 20 iterations.
[[file:table6.png|550px|center]]
Besides, they also evaluated the effect of the joint reinforcement learning of both policy and value. Replacing final policy and value with the best one from policy-only or value-only both decreased performance.

*Evaluation on the Whole Miz40 Dataset.
The authors split Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, ''mlCoP'' represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and ''bare prover'' stands for the prover implemented in this paper, without MCT guidance.

[[file:atp_result0.jpg|frane|550px|center|Figure 5a. Experimental result on Mizar40 dataset]]
[[file:atp_result1.jpg|frame|550px|center|Figure 5b. More experimental result on Mizar40 dataset]]

As shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]

Besides these results, there were also found that some test problems could be solved with rlCoP easily but mlCoP could not.

== Conclusions ==
In this work, the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]

The authors pose that some future research could include strong learning algorithms to characterize mathematical data. The development of suitable deep learning architectures will help the algorithm characterize semantic and syntactic features of mathematical objects which will be crucial to create strong assistants for mathematics and hard sciences.

== Critiques ==
Until now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.

The feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior to human-engineered features, which is also shown in [4, 6].

Also, the policy-value learning iteration is quite inefficient. The learning loop is:
* Loop
** Run MCT with the previous model on an entire dataset
** Collect MCT data
** Train a new model
If we adopt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.

The experimental design of this paper has some flaws. The authors compare the performance of ''mlCoP'' and ''rlCoP'' by limiting them to the same number of inference steps. However, every inference step of ''rlCoP'' requires additional machine learning prediction, which costs more time. A better way to compare their performance is to set a time limit.

It would also be interesting to study automated theorem proving in another logic system, like high order logic.

== References ==
[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.

[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.

[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.

[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.

[5] M. F¨arber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor,
26th International Conference on Automated Deduction (CADE), volume 10395 of LNCS,
pages 563–579. Springer, 2017.

[6] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.

[7] Mizar Math Library. http://mizar.org/library/

[8] J. Urban, G. Sutcliffe, P. Pudla ́k, and J. Vyskocˇil. MaLARea SG1 - Machine Learner for Automated Reasoning with Semantic Guidance. In A. Armando, P. Baumgartner, and G. Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.

[9] J. Otten and W. Bibel. leanCoP: lean connection-based theorem proving. J. Symb. Comput., 36(1-2):139–161, 2003.

[10] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Rea-
soning, 3(2):153–245, 2010

Fix your classifier: the marginal value of training the last weight layer

2018-11-30T03:28:43Z

J385chen:

The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.

=Introduction=

Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) being used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.

=Brief Overview=

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

=Previous Work=

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

* Weight sharing and specification (Han et al., 2015)

* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)

* Low-rank approximations to speed up CNN (Tai et al., 2015)

* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

=Background=

Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.

== Shortcomings of the Final Classification Layer and its Solution ==

Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

=Proposed Method=

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

==Using a Fixed Classifier==

Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.

For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

<math>
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i ∈ </math> { <math> {1, . . . , C} </math> }

and reducing the expected negative log likelihood with respect to ground-truth target <math> t ∈ </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:

<math>
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})
</math>

where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.

==Choosing the Projection Matrix==

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q ∈ R^{N×C} </math>, such that <math> ∀ i ≠ j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:

<center><math>
\hat{x} = \frac{x}{||x||_{2}}
</math></center>

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b ∈ \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

<center>
<math>
v_i=\frac{e^{\alpha q_i · \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j · \hat{x} + b_j}}, i ∈ </math> { <math> {1,...,C} </math>}
</center>

and the loss to be minimized is:

<center><math>
L(x, t) = -\alpha q_t · \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i · \frac{x}{||x||_{2}} + b_i)))
</math></center>

where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t ∈ </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in
[[Media: figure1_log_behave.png| Figure 1]].

<center>[[File:figure1_log_behave.png]]</center>

When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is

<center>[[File:caloss.png]]</center>

But its final validation accuracy has slight decrease, compared to original models.

==Using a Hadmard Matrix==

To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} ∈ </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:

<center><math>
y = \hat{H} \hat{x} + b
</math></center>

This usage allows two main benefits:
* A deterministic, low-memory and easily generated matrix that can be used for classification.
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.

Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.

=Experimental Results=

The authors have evaluated their proposed model on the following datasets:

==CIFAR-10/100==

===About the Dataset===

CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.

===Training Details===

The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].

<center>[[File: figure1_resnet_cifar10.png]]</center>

These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors' conjecture is that with the new fixed parameterization, the network can no longer increase the norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.

The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.

<center>[[File: figure3_alpha_resnet_cifar.png]]</center>

The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.

==IMAGENET==

===About the Dataset===

The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.

===Experiment Details===

The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.

For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy. Interestingly, this method allowed Imagenet training in an under-specified regime, where there are
more training samples than number of parameters. This is an unconventional regime for modern deep networks, which are usually over-specified to have many more parameters than training samples (Zhang et al., 2017a).

The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].

<center>[[File: table1_fixed_results.png]]</center>

==Language Modelling==

Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.

<center>[[File: language.png]]</center>

=Discussion=

==Implications and Use Cases==

With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.

==Possible Caveats==

The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.

Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.

==Future Work==

The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.

Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.

A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]

=Conclusion=

In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.

Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.

=Critique=

The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).

=References=

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6
(6):1184–1238, 1978.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.

Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,
pp. 157, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.

Fix your classifier: the marginal value of training the last weight layer

2018-11-30T03:24:16Z

J385chen:

The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.

=Introduction=

Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) being used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.

=Brief Overview=

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

=Previous Work=

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

* Weight sharing and specification (Han et al., 2015)

* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)

* Low-rank approximations to speed up CNN (Tai et al., 2015)

* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

=Background=

Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.

== Shortcomings of the Final Classification Layer and its Solution ==

Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

=Proposed Method=

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

==Using a Fixed Classifier==

Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.

For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

<math>
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i ∈ </math> { <math> {1, . . . , C} </math> }

and reducing the expected negative log likelihood with respect to ground-truth target <math> t ∈ </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:

<math>
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})
</math>

where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.

==Choosing the Projection Matrix==

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q ∈ R^{N×C} </math>, such that <math> ∀ i ≠ j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:

<center><math>
\hat{x} = \frac{x}{||x||_{2}}
</math></center>

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b ∈ \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

<center>
<math>
v_i=\frac{e^{\alpha q_i · \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j · \hat{x} + b_j}}, i ∈ </math> { <math> {1,...,C} </math>}
</center>

and the loss to be minimized is:

<center><math>
L(x, t) = -\alpha q_t · \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i · \frac{x}{||x||_{2}} + b_i)))
</math></center>

where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t ∈ </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in
[[Media: figure1_log_behave.png| Figure 1]].

<center>[[File:figure1_log_behave.png]]</center>

When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is

<center>[[File:caloss.png]]</center>

But its final validation accuracy has slight decrease, compared to original models.

==Using a Hadmard Matrix==

To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} ∈ </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:

<center><math>
y = \hat{H} \hat{x} + b
</math></center>

This usage allows two main benefits:
* A deterministic, low-memory and easily generated matrix that can be used for classification.
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.

Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.

=Experimental Results=

The authors have evaluated their proposed model on the following datasets:

==CIFAR-10/100==

===About the Dataset===

CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.

===Training Details===

The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].

<center>[[File: figure1_resnet_cifar10.png]]</center>

These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors' conjecture is that with the new fixed parameterization, the network can no longer increase the norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.

The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.

<center>[[File: figure3_alpha_resnet_cifar.png]]</center>

The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.

==IMAGENET==

===About the Dataset===

The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.

===Experiment Details===

The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.

For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.

The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].

<center>[[File: table1_fixed_results.png]]</center>

==Language Modelling==

Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.

<center>[[File: language.png]]</center>

=Discussion=

==Implications and Use Cases==

With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.

==Possible Caveats==

The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.

Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.

==Future Work==

The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.

Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.

A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]

=Conclusion=

In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.

Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.

=Critique=

The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).

=References=

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6
(6):1184–1238, 1978.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.

Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,
pp. 157, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.

Fix your classifier: the marginal value of training the last weight layer

2018-11-30T03:23:25Z

J385chen:

The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.

=Introduction=

Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) being used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.

=Brief Overview=

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

=Previous Work=

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

* Weight sharing and specification (Han et al., 2015)

* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)

* Low-rank approximations to speed up CNN (Tai et al., 2015)

* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

=Background=

Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.

== Shortcomings of the Final Classification Layer and its Solution ==

Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

=Proposed Method=

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

==Using a Fixed Classifier==

Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.

For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

<math>
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i ∈ </math> { <math> {1, . . . , C} </math> }

and reducing the expected negative log likelihood with respect to ground-truth target <math> t ∈ </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:

<math>
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})
</math>

where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.

==Choosing the Projection Matrix==

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q ∈ R^{N×C} </math>, such that <math> ∀ i ≠ j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:

<center><math>
\hat{x} = \frac{x}{||x||_{2}}
</math></center>

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b ∈ \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

<center>
<math>
v_i=\frac{e^{\alpha q_i · \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j · \hat{x} + b_j}}, i ∈ </math> { <math> {1,...,C} </math>}
</center>

and the loss to be minimized is:

<center><math>
L(x, t) = -\alpha q_t · \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i · \frac{x}{||x||_{2}} + b_i)))
</math></center>

where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t ∈ </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in
[[Media: figure1_log_behave.png| Figure 1]].

<center>[[File:figure1_log_behave.png]]</center>

When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is

<center>[[File:caloss.png]]</center>

But its final validation accuracy has slight decrease, compared to original models.

==Using a Hadmard Matrix==

To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} ∈ </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:

<center><math>
y = \hat{H} \hat{x} + b
</math></center>

This usage allows two main benefits:
* A deterministic, low-memory and easily generated matrix that can be used for classification.
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.

Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.

=Experimental Results=

The authors have evaluated their proposed model on the following datasets:

==CIFAR-10/100==

===About the Dataset===

CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.

===Training Details===

The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].

<center>[[File: figure1_resnet_cifar10.png]]</center>

These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors' conjecture is that with the new fixed parameterization, the network can no longer increase the
norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.

The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.

<center>[[File: figure3_alpha_resnet_cifar.png]]</center>

The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.

==IMAGENET==

===About the Dataset===

The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.

===Experiment Details===

The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.

For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.

The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].

<center>[[File: table1_fixed_results.png]]</center>

==Language Modelling==

Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.

<center>[[File: language.png]]</center>

=Discussion=

==Implications and Use Cases==

With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.

==Possible Caveats==

The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.

Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.

==Future Work==

The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.

Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.

A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]

=Conclusion=

In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.

Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.

=Critique=

The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).

=References=

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6
(6):1184–1238, 1978.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.

Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,
pp. 157, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.

File:language.png

2018-11-30T03:21:14Z

J385chen:

Fix your classifier: the marginal value of training the last weight layer

2018-11-30T03:21:03Z

J385chen:

The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.

=Introduction=

Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) being used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.

=Brief Overview=

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

=Previous Work=

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

* Weight sharing and specification (Han et al., 2015)

* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)

* Low-rank approximations to speed up CNN (Tai et al., 2015)

* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

=Background=

Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.

== Shortcomings of the Final Classification Layer and its Solution ==

Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

=Proposed Method=

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

==Using a Fixed Classifier==

Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.

For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

<math>
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i ∈ </math> { <math> {1, . . . , C} </math> }

and reducing the expected negative log likelihood with respect to ground-truth target <math> t ∈ </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:

<math>
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})
</math>

where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.

==Choosing the Projection Matrix==

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q ∈ R^{N×C} </math>, such that <math> ∀ i ≠ j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:

<center><math>
\hat{x} = \frac{x}{||x||_{2}}
</math></center>

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b ∈ \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

<center>
<math>
v_i=\frac{e^{\alpha q_i · \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j · \hat{x} + b_j}}, i ∈ </math> { <math> {1,...,C} </math>}
</center>

and the loss to be minimized is:

<center><math>
L(x, t) = -\alpha q_t · \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i · \frac{x}{||x||_{2}} + b_i)))
</math></center>

where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t ∈ </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in
[[Media: figure1_log_behave.png| Figure 1]].

<center>[[File:figure1_log_behave.png]]</center>

When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is

<center>[[File:caloss.png]]</center>

But its final validation accuracy has slight decrease, compared to original models.

==Using a Hadmard Matrix==

To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} ∈ </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:

<center><math>
y = \hat{H} \hat{x} + b
</math></center>

This usage allows two main benefits:
* A deterministic, low-memory and easily generated matrix that can be used for classification.
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.

Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.

=Experimental Results=

The authors have evaluated their proposed model on the following datasets:

==CIFAR-10/100==

===About the Dataset===

CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.

===Training Details===

The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].

<center>[[File: figure1_resnet_cifar10.png]]</center>

These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors conjecture is that with the new fixed parameterization, the network can no longer increase the
norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.

The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.

<center>[[File: figure3_alpha_resnet_cifar.png]]</center>

The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.

==IMAGENET==

===About the Dataset===

The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.

===Experiment Details===

The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.

For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.

The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].

<center>[[File: table1_fixed_results.png]]</center>

==Language Modelling==

Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.

<center>[[File: language.png]]</center>

=Discussion=

==Implications and Use Cases==

With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.

==Possible Caveats==

The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.

Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.

==Future Work==

The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.

Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.

A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]

=Conclusion=

In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.

Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.

=Critique=

The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).

=References=

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6
(6):1184–1238, 1978.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.

Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,
pp. 157, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.

CapsuleNets

2018-11-30T03:15:46Z

J385chen:

The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "[https://openreview.net/pdf?id=HJWLfGWRb Matrix Capsules with EM Routing]" for ICLR 2018.

=Motivation=

Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.

The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. This paper explores an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity. The length of the vector output of a capsule cannot exceed 1 because of an application of a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.

The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. The authors demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects

==Adversarial Examples==

First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally fool a machine learning model. An example of an adversarial example is shown below:

[[File:adversarial_img_1.png ‎|center]]
To the human eye, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defences are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as: its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.

==Drawbacks of CNNs==
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a <math>k \cdot k</math> kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features, but causes valuable spatial information to be lost.

Also, in CNNs, higher-level features combine lower-level features as a weighted sum: activations of a previous layer multiplied by current layer's weight, then passed to another activation function. In this process, pose relationship between simpler features is not part of the higher-level feature.

In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.
In deep learning, the activation level of a neuron is often interpreted as the likelihood of detecting a specific feature. CNNs are good at detecting features but less effective at exploring the spatial relationships among features (perspective, size, orientation).

[[File:Equivariance Face.png ‎|center]]

Here, the CNN could wrongly activate the neuron for the face detection. Without realize the mis-match in spatial orientation and size, the activation for the face detection will be too high.

Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.

[[File:kitten.jpeg ‎|center]]

[[File:kitten-rotated-180.jpg ‎|center]]

For a more in depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).

==Intuition for Capsules==
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed.

To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).

[[File:Rotational Invariance.jpeg ‎|center]]

Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks require.

=Background, Notation, and Definitions=

==What is a Capsule==
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."

In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.

A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:

<blockquote>
A capsule network consists of several layers of capsules. The set of capsules in layer L is denoted
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the
activities in a standard neural net: they depend on the current input and are not stored. In between
each capsule i in layer L and each capsule j in layer L + 1 is a 4x4 trainable transformation matrix,
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they
are learned discriminatively. The pose matrix of capsule i is transformed by <math>W_{ij}</math> to cast a vote
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule j. The poses and activations of all the capsules in layer
L + 1 are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all
<math>i \in \Omega_L, j \in \Omega_{L+1}</math>
</blockquote>
<math></math>

==Notation==

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0.

\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}

where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.

For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math>

\begin{align}
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i
\end{align}
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.

The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.

\begin{align}
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}
\end{align}

=Network Training and Dynamic Routing=

==Understanding Capsules==
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.

[[File:CapsuleNets.jpeg|center|800px]]

The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.

We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.

[[File:Predictions.jpeg ‎|center]]

==Dynamic Routing==
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math>

In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.
[[File:Dynamic Routing.png|center|900px]]

From the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observations into capsule K; we adjust the corresponding weights upward during training.

These weights are determined through the dynamic routing procedure:

[[File:Routing Algo.png‎|900px]]

Note that the convergence of this routing procedure has been questioned. Although it is empirically shown that this procedure converges, the convergence has not been proven.

Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper was released in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).

=Architecture=
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.

==Loss Function==
[[File:Loss Function.png‎|900px]]

The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.

A graphical representation of loss function values under varying vector norms is given below.
[[File:Loss function chart.png|900px]]

==Encoder Layers==
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising.

[[File:Architecture.png|center|900px]]

The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.

'''Layer 1: Convolution''':
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.

'''Layer 2: PrimaryCaps''':
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.

'''Layer 3: DigiCaps''':
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output a ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.

==Decoder Layers==
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.

[[File:Decoder.png|center|900px]]

The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.

In addition to the digicaps loss function, we add reconstruction error as a form of regularization. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.

[[File:Reconstruction.png|center|900px]]

=MNIST Experimental Results=

==Accuracy==
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.

[[File:Accuracies.png|center|900px]]

==What Capsules Represent for MNIST==
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties.
[[File:CapsuleReps.png|center|900px]]

One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.

==Robustness of CapsNet==
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.

To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.

=MultiMNIST & Other Experiments=

==MultiMNIST==
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).

There are some additional steps to generating the MultiMNIST dataset.

1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)

2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.

[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.
The two reconstructed digits are overlayed in green and red as the lower image. The upper image
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)
represents the two digits used for reconstruction. The two right most columns show two examples
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have
correct classifications and show that the model accounts for all the pixels while being able to assign
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a
digit that is neither the label nor the prediction. These columns suggest that the model is not just
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other
support.]]

==Other datasets==
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.

The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.

=Critique=
Although the network performs incredibly favorable in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.

Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.

Moreover, there is no underlying intuition provided on the main point of the paper which is that capsule nets preserve relations between extracted features from the proposed architecture. An explanation on the intuition behind this idea will go a long way in arguing against CNN networks.

Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done.

Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.

* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.

=Future Work=
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.
Moreover, the authors hint towards trying to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.

Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.

=References=
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]
#Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-30T03:12:34Z

J385chen:

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

Time series forecasting is focused on modeling the predictors of future values of time series given their past. As in many cases the relationship between past and future observations is not deterministic, this amounts to expressing the conditional probability distribution as a function of the past observations: The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
This forecasting problem has been approached almost independently by econometrics and machine learning communities. In this paper, the authors focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time serieses are often forecasted using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanism for neural networks has ability to overcome the problem of vanishing gradients, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid non-linearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as, the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are modelled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available at [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value. Note that a series with K sources is K + 1-dimensional in synchronous case and K + 2-dimensional in asynchronous case. The base series in all processes was a stationary AR(10) series. Although that series has the true order of 10, in the experimental setting the input data included past 60 observations. The rationale behind that is twofold: not only is the data observed in irregular random times but also in real–life problems the order of the model is unknown.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

[[File:async.png | 520px|center|]]

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible. This makes it hard to justify the introduction of the auxillary loss function <math>L^{aux}</math>.

Also, using artificial dataset as experimental result is not a good practice in this paper. This is essentially an application paper, and such dataset makes results hard to reproduce, and cannot support the performance claim of the model.

[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple lines and green lines seem to stay at the same position in training and testing process. SOCNN and single-layer LSTM are most robust and least prone to overfitting comparing to other networks.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.
#The paper does not specify how training and testing set are separated in detail, which is quite important in time-series problems. Moreover, rolling or online-based learning scheme should be used in comparison, since they are standard in time-series prediction tasks.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-30T03:10:50Z

J385chen:

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

Time series forecasting is focused on modeling the predictors of future values of time series given their past. As in many cases the relationship between past and future observations is not deterministic, this amounts to expressing the conditional probability distribution as a function of the past observations: The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
This forecasting problem has been approached almost independently by econometrics and machine learning communities. In this paper, the authors focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecasted using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanism for neural networks has ability to overcome the problem of vanishing gradients, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid non-linearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as, the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are modelled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available at [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value. Note that a series with K sources is K + 1-dimensional in synchronous case and K + 2-dimensional in asynchronous case. The base series in all processes was a stationary AR(10) series. Although that series has the true order of 10, in the experimental setting the input data included past 60 observations. The rationale behind that is twofold: not only is the data observed in irregular random times but also in real–life problems the order of the model is unknown.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

[[File:async.png | 520px|center|]]

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible. This makes it hard to justify the introduction of the auxillary loss function <math>L^{aux}</math>.

Also, using artificial dataset as experimental result is not a good practice in this paper. This is essentially an application paper, and such dataset makes results hard to reproduce, and cannot support the performance claim of the model.

[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple lines and green lines seem to stay at the same position in training and testing process. SOCNN and single-layer LSTM are most robust and least prone to overfitting comparing to other networks.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.
#The paper does not specify how training and testing set are separated in detail, which is quite important in time-series problems. Moreover, rolling or online-based learning scheme should be used in comparison, since they are standard in time-series prediction tasks.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-30T03:09:28Z

J385chen:

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

Time series forecasting is focused on modeling the predictors of future values of time series given their past. As in many cases the relationship between past and future observations is not deterministic, this amounts to expressing the conditional probability distribution as a function of the past observations: The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
This forecasting problem has been approached almost independently by econometrics and machine learning communities. In this paper, the authors focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanism for neural networks has ability to overcome the problem of vanishing gradients, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid non-linearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as, the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are modelled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available at [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value. Note that a series with K sources is K + 1-dimensional in synchronous case and K + 2-dimensional in asynchronous case. The base series in all processes was a stationary AR(10) series. Although that series has the true order of 10, in the experimental setting the input data included past 60 observations. The rationale behind that is twofold: not only is the data observed in irregular random times but also in real–life problems the order of the model is unknown.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

[[File:async.png | 520px|center|]]

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible. This makes it hard to justify the introduction of the auxillary loss function <math>L^{aux}</math>.

Also, using artificial dataset as experimental result is not a good practice in this paper. This is essentially an application paper, and such dataset makes results hard to reproduce, and cannot support the performance claim of the model.

[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple lines and green lines seem to stay at the same position in training and testing process. SOCNN and single-layer LSTM are most robust and least prone to overfitting comparing to other networks.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.
#The paper does not specify how training and testing set are separated in detail, which is quite important in time-series problems. Moreover, rolling or online-based learning scheme should be used in comparison, since they are standard in time-series prediction tasks.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

ShakeDrop Regularization

2018-11-30T03:00:31Z

J385chen:

=Introduction=
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. Note that the authors of Shake-Shake are rejecting the claim of their memory inefficiency. They claimed that there is no memory issue, just because there are <math>2\times</math> branches doesn't mean Shake-Shake needs <math>2\times</math> memory as it can use less memory to achieve the same performance.

To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed.ShakeDrop disturbs learning more strongly by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. In addition, a different factor from the forward pass is multiplied in the backward training pass. As a byproduct, however, learning process gets unstable. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.

=Existing Methods=

'''Deep Approaches'''

'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.

Intuition behind Residual blocks:
If the identity mapping is optimal, We can easily push the residuals to zero (F(x) = 0) than to fit an identity mapping (x, input=output) by a stack of non-linear layers. In simple language it is very easy to come up with a solution like F(x) =0 rather than F(x)=x using stack of non-linear cnn layers as function (Think about it). So, this function F(x) is what the authors called Residual function ([https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624 Reference]).

[[File:ResidualBlock.png|580px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]

ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.

'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.

[[File:ResidualBlockComparison.png|980px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017. The width of a block reflects the number of channels used in that layer.]]

'''Non-Deep Approaches'''

'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.

'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.

[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]

'''Regularization Methods For Residual Blocks'''

'''Stochastic Depth''' works by randomly dropping paths in the residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Unlike sequential networks, there are many paths from the input to the output in these networks. By dropping some of the connections, the network is forced to flow through different paths to get the final deep layer representation. In a way it is similar to dropout, but for paths in multi-path networks. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. Essentially, the probability of a connection dropping in inversely proportional to the its depth in the network.

'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt (multiple residual connections) architecture. It is given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. Essentially, one of the parallel residual connections is dropped in the forward direction. This is similar to stochastic depth regularization, but a residual path always exists.
Moreover, on the backward pass a similar random variable <math>\beta</math> is used to independently drop paths for gradient flow. This has the effect of adding noise in the gradients update process and improved performance over the vanilla ResNeXt network.

[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]

=Proposed Method=
We give an intuitive interpretation of the forward pass of Shake-Shake regularization. To the best of our knowledge, it has not been given yet, while the phenomenon in the backward pass is experimentally investigated by Gastaldi (2017). In the forward pass, Shake-Shake interpolates the outputs of two residual branches with a random variable α that controls the degree of interpolation. As DeVries & Taylor (2017a) demonstrated that interpolation of two data in the feature space can synthesize reasonable augmented data, the interpolation of two residual blocks of Shake-Shake in the forward pass can be interpreted as synthesizing data. Use of a random variable α generates many different augmented data. On the other hand, in the backward pass, a different random variable β is used to disturb learning to make the network learnable long time. Gastaldi (2017) demonstrated how the difference between <math>\alpha</math> and <math>\beta</math> affects.

The regularization mechanism of Shake-Shake relies on two or more residual branches, so that it can be applied only to 2-branch networks architectures. In addition, 2-branch network architectures consume more memory than 1-branch network architectures. One may think the number of learnable parameters of ResNeXt can be kept in 1-branch and 2-branch network architectures by controlling its cardinality and the number of channels (filters). For example, a 1-branch network (e.g., ResNeXt 1-64d) and its corresponding 2-branch network (e.g., ResNeXt 2-40d) have almost same number of learnable parameters. However, even so, it increases memory consumption due to the overhead to keep the inputs of residual blocks and so on. By comparing ResNeXt 1-64d and 2-40d, the latter requires more memory than the former by 8% in theory (for one layer) and by 11% in measured values (for 152 layers).

This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.

This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two.

'''ShakeDrop''' is given as

<div align="center">
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,
</div>

where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is

<div align="center">
<math>
G(x) = \begin{cases}
x + F(x) ~~ \text{if } b_l = 1 \\
x + \alpha F(x) ~~ \text{otherwise}
\end{cases}
</math>
</div>

If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.

=Experiments=

'''Parameter Search'''

The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below.

[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]

The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.

[[File:ParameterUpdateShakeDrop.png|400px|centre]]

Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.

'''Comparison with Regularization Methods'''

For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch).

These experiments are performed on the CIFAR-100 dataset.

[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]

For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing.

[[File:CosineAnnealing.png|400px|centre|thumb|]]

The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017).

[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]

'''State-of-the-Art Comparisons'''

A direct comparison with state of the art methods is favorable for this new method.

# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.
# Comparison with the state-of-the-arts, PyramidNet + ShakeDrop gives an improvement of 0.25% on CIFAR-10 than ResNeXt + Shake-Shake + Cutout, PyramidNet + ShakeDrop gives an improvement of 2.85% on CIFAR-100 than Coupled Ensemble.

=Implementation details=

'''CIFAR-10/100 datasets'''

All the images in these datasets were color normalized and then horizontally flipped with a probability of 50%. All of then then were zero padded to have a dimentionality of 40 by 40 pixels.

=Conclusion=
The paper proposes a new form of regularization that is an extension of "Shake-Shake" regularization [Gastaldi, 2017]. The original "shake-shake" proposes using two residual paths adding to the same output, and during training, considering different randomly selected convex combinations of the two paths (while using an equally weighted combination at test time). This paper contends that this requires additional memory, and attempts to achieve similar regularization with a single path. To do so, they train a network with a single residual path, where the residual is included without attenuation in some cases with some fixed probability, and attenuated randomly (or even inverted) in others. The paper contends that this achieves superior performance than choosing simply a random attenuation for every sample (although, this can be seen as choosing an attenuation under a distribution with some fixed probability mass.

Their stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.

=Critique=

The novelty of this paper is low as pointed out by the reviewers. Also, there is a confusion whether or not the results could be replicated as <math>\alpha</math> and <math>\beta</math> are choosen randomly. The proposed ShakeDrop regularization is essentially a combination of the PyramidDrop and Shake-Shake regularization. The most surprising part is that the forward weight can be negative thus inverting the output of a convolution. The mathematical justification for ShakeDrop regularization is limited, relying on intuition and empirical evidence instead.

One downside of this methods (as was identified in the presentation as well) is that the training for cosine annealing variation of the model takes 1800 epochs which is time intensive compared to other methods that were compared as baselines. This can limit practical implementation of this algorithm.

As pointed out from the above, the method basically relies heavily on the intuition. This means that the performance of the algorithm can not been extended beyond the CIFAR dataset and can vary a lot depending on the characteristics of data sets that users are performing, with some exaggeration. However, the performance is still impressive since it performs better than known algorithms. It is not clear as to how the proposed technique would work with a non-residual architecture.
It lacks conclusive proof that "shake-drop" is a generically useful regularization technique. For one, the method is evaluated only on small toy-datasets: CIFAR-10 and CIFAR-100. Evaluation on Imagenet perhaps would have been valuable.

=References=
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.

[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.

[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.

[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.

[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.

[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.

[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.

[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.

[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.

ShakeDrop Regularization

2018-11-30T02:58:57Z

J385chen:

=Introduction=
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply.

To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed.ShakeDrop disturbs learning more strongly by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. In addition, a different factor from the forward pass is multiplied in the backward training pass. As a byproduct, however, learning process gets unstable. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.

=Existing Methods=

'''Deep Approaches'''

'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.

Intuition behind Residual blocks:
If the identity mapping is optimal, We can easily push the residuals to zero (F(x) = 0) than to fit an identity mapping (x, input=output) by a stack of non-linear layers. In simple language it is very easy to come up with a solution like F(x) =0 rather than F(x)=x using stack of non-linear cnn layers as function (Think about it). So, this function F(x) is what the authors called Residual function ([https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624 Reference]).

[[File:ResidualBlock.png|580px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]

ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.

'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.

[[File:ResidualBlockComparison.png|980px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017. The width of a block reflects the number of channels used in that layer.]]

'''Non-Deep Approaches'''

'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.

'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.

[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]

'''Regularization Methods For Residual Blocks'''

'''Stochastic Depth''' works by randomly dropping paths in the residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Unlike sequential networks, there are many paths from the input to the output in these networks. By dropping some of the connections, the network is forced to flow through different paths to get the final deep layer representation. In a way it is similar to dropout, but for paths in multi-path networks. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. Essentially, the probability of a connection dropping in inversely proportional to the its depth in the network.

'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt (multiple residual connections) architecture. It is given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. Essentially, one of the parallel residual connections is dropped in the forward direction. This is similar to stochastic depth regularization, but a residual path always exists.
Moreover, on the backward pass a similar random variable <math>\beta</math> is used to independently drop paths for gradient flow. This has the effect of adding noise in the gradients update process and improved performance over the vanilla ResNeXt network.

[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]

=Proposed Method=
We give an intuitive interpretation of the forward pass of Shake-Shake regularization. To the best of our knowledge, it has not been given yet, while the phenomenon in the backward pass is experimentally investigated by Gastaldi (2017). In the forward pass, Shake-Shake interpolates the outputs of two residual branches with a random variable α that controls the degree of interpolation. As DeVries & Taylor (2017a) demonstrated that interpolation of two data in the feature space can synthesize reasonable augmented data, the interpolation of two residual blocks of Shake-Shake in the forward pass can be interpreted as synthesizing data. Use of a random variable α generates many different augmented data. On the other hand, in the backward pass, a different random variable β is used to disturb learning to make the network learnable long time. Gastaldi (2017) demonstrated how the difference between <math>\alpha</math> and <math>\beta</math> affects.

The regularization mechanism of Shake-Shake relies on two or more residual branches, so that it can be applied only to 2-branch networks architectures. In addition, 2-branch network architectures consume more memory than 1-branch network architectures. One may think the number of learnable parameters of ResNeXt can be kept in 1-branch and 2-branch network architectures by controlling its cardinality and the number of channels (filters). For example, a 1-branch network (e.g., ResNeXt 1-64d) and its corresponding 2-branch network (e.g., ResNeXt 2-40d) have almost same number of learnable parameters. However, even so, it increases memory consumption due to the overhead to keep the inputs of residual blocks and so on. By comparing ResNeXt 1-64d and 2-40d, the latter requires more memory than the former by 8% in theory (for one layer) and by 11% in measured values (for 152 layers).

This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.

This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two.

'''ShakeDrop''' is given as

<div align="center">
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,
</div>

where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is

<div align="center">
<math>
G(x) = \begin{cases}
x + F(x) ~~ \text{if } b_l = 1 \\
x + \alpha F(x) ~~ \text{otherwise}
\end{cases}
</math>
</div>

If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.

=Experiments=

'''Parameter Search'''

The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below.

[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]

The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.

[[File:ParameterUpdateShakeDrop.png|400px|centre]]

Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.

'''Comparison with Regularization Methods'''

For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch).

These experiments are performed on the CIFAR-100 dataset.

[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]

For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing.

[[File:CosineAnnealing.png|400px|centre|thumb|]]

The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017).

[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]

'''State-of-the-Art Comparisons'''

A direct comparison with state of the art methods is favorable for this new method.

# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.
# Comparison with the state-of-the-arts, PyramidNet + ShakeDrop gives an improvement of 0.25% on CIFAR-10 than ResNeXt + Shake-Shake + Cutout, PyramidNet + ShakeDrop gives an improvement of 2.85% on CIFAR-100 than Coupled Ensemble.

=Implementation details=

'''CIFAR-10/100 datasets'''

All the images in these datasets were color normalized and then horizontally flipped with a probability of 50%. All of then then were zero padded to have a dimentionality of 40 by 40 pixels.

=Conclusion=
The paper proposes a new form of regularization that is an extension of "Shake-Shake" regularization [Gastaldi, 2017]. The original "shake-shake" proposes using two residual paths adding to the same output, and during training, considering different randomly selected convex combinations of the two paths (while using an equally weighted combination at test time). This paper contends that this requires additional memory, and attempts to achieve similar regularization with a single path. To do so, they train a network with a single residual path, where the residual is included without attenuation in some cases with some fixed probability, and attenuated randomly (or even inverted) in others. The paper contends that this achieves superior performance than choosing simply a random attenuation for every sample (although, this can be seen as choosing an attenuation under a distribution with some fixed probability mass.

Their stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.

=Critique=

The novelty of this paper is low as pointed out by the reviewers. Also, there is a confusion whether or not the results could be replicated as <math>\alpha</math> and <math>\beta</math> are choosen randomly. The proposed ShakeDrop regularization is essentially a combination of the PyramidDrop and Shake-Shake regularization. The most surprising part is that the forward weight can be negative thus inverting the output of a convolution. The mathematical justification for ShakeDrop regularization is limited, relying on intuition and empirical evidence instead.

One downside of this methods (as was identified in the presentation as well) is that the training for cosine annealing variation of the model takes 1800 epochs which is time intensive compared to other methods that were compared as baselines. This can limit practical implementation of this algorithm.

As pointed out from the above, the method basically relies heavily on the intuition. This means that the performance of the algorithm can not been extended beyond the CIFAR dataset and can vary a lot depending on the characteristics of data sets that users are performing, with some exaggeration. However, the performance is still impressive since it performs better than known algorithms. It is not clear as to how the proposed technique would work with a non-residual architecture.
It lacks conclusive proof that "shake-drop" is a generically useful regularization technique. For one, the method is evaluated only on small toy-datasets: CIFAR-10 and CIFAR-100. Evaluation on Imagenet perhaps would have been valuable.

=References=
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.

[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.

[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.

[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.

[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.

[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.

[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.

[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.

[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T01:03:42Z

J385chen:

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter <math>m</math>, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

- It is proposed that we should compare learning rate decay with batch-size increase under the setting that total budget / number of training samples is fixed.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T00:59:23Z

J385chen:

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter <math>m</math>, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T00:56:53Z

J385chen:

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T00:54:32Z

J385chen:

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T00:49:25Z

J385chen:

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_eff = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-11-30T00:33:30Z

J385chen:

=Introduction=

It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.

With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].

In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix.

Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.
* Define all interaction forms of interest, then later finds the important ones.
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations.
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.

The approach in this paper is to extract non-additive interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.

Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer.
[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.

[[File:algorithm1.PNG|850px|center]]

=Cut-off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,

<center><math>
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)
</math></center>

From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, We study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, we are going to test on 10 synthetic functions as shown in table I.

[[File:synthetic.PNG|900px|center]]

We use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.

And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.

For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks. Also, it was pointed out that the neural network weights heavily depend on L-1 regularized neural network training, but a group lasso penalty may work better.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

=Reference=

[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.

[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.

[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.

[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.

[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.

[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-11-30T00:24:07Z

J385chen:

=Introduction=

It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.

With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].

In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix.

Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.
* Define all interaction forms of interest, then later finds the important ones.
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations.
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.

The approach in this paper is to extract non-additive interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.

Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer.
[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.

[[File:algorithm1.PNG|850px|center]]

=Cut-off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,

<center><math>
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)
</math></center>

From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, We study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, we are going to test on 10 synthetic functions as shown in table I.

[[File:synthetic.PNG|900px|center]]

We use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.

And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.
For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

=Reference=

[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.

[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.

[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.

[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.

[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.

[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-11-30T00:14:16Z

J385chen:

=Introduction=

It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.

With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].

In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix.

Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.
* Define all interaction forms of interest, then later finds the important ones.
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations.
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.

The approach in this paper is to extract non-additive interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.

Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer.
[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.

[[File:algorithm1.PNG|850px|center]]

=Cut off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,

<center><math>
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)
</math></center>

From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, We study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, we are going to test on 10 synthetic functions as shown in table I.

[[File:synthetic.PNG|900px|center]]

We use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.

And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.
For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

=Reference=

[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.

[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.

[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.

[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.

[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.

[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

a neural representation of sketch drawings

2018-11-30T00:03:58Z

J385chen:

== Introduction ==
In this paper, the authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.

Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do.

The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).

=== Terminology ===
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp.

Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images.

For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images.

For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation.

== Related Work ==
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot [26, 28] and some reinforcement learning approaches[28], Reinforcement Learning to discover a set of paint brush strokes that can best represent a given input photograph. They work more like a mimic of digitized photographs. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models [25] or Mixture Density Networks [2] to generate human sketches, continuous data points (modelling Chinese characters as a sequence of pen stroke actions) or vectorized Kanji characters [9,29].

The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.

The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.

== Major Contributions ==
This paper makes the following major contributions: Authors outline a framework for both unconditional and
conditional generation of vector images composed of a sequence of lines. The recurrent neural
network-based generative model is capable of producing sketches of common objects in a vector
format. The paper develops a training procedure unique to vector images to make the training more robust. The paper also made available
a large dataset of hand drawn vector images to encourage further development of generative modelling
for vector images, and also release an implementation of our model as an open source project

== Methodology ==
=== Dataset ===
QuickDraw is a dataset with 50 million vector drawings collected by an online game [https://quickdraw.withgoogle.com/# Quick Draw!], where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.

The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.

=== Sketch-RNN ===
[[File:sketchfig2.png|700px|center]]

The model is a Sequence-to-Sequence Variational Autoencoder(VAE).

==== Encoder ====
The encoder is a bidirectional RNN. The input is a sketch sequence denoted by <math>S =\{S_0, S_1, ... S_{N_{s}}\}</math> and a reversed sketch sequence denoted by <math>S_{reverse} = \{S_{N_{s}},S_{N_{s}-1}, ... S_0\}</math>. The final hidden layer representations of the two encoded sequences <math>(h_{ \rightarrow}, h_{ \leftarrow})</math> are concatenated to form a latent vector, <math>h</math>, of size <math>N_{z}</math>,

\begin{split}
&h_{ \rightarrow} = encode_{ \rightarrow }(S), \\
&h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), \\
&h = [h_{\rightarrow}; h_{\leftarrow}].
\end{split}

Then the authors project <math>h</math> into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> of size <math>N_{z}</math>. The projection is performed using a fully connected layer. These two vectors are the parameters of the latent space Gaussian distribution that will estimate the distribution of the input data. Because standard deviations cannot be negative, an exponential function is used to convert it to all positive values. Next, a random variable with mean <math>\mu</math> and standard deviation <math>\sigma</math> is constructed by scaling a normalized IID Gaussian, <math>\mathcal{N}(0,I)</math>,

\begin{split}
& \mu = W_\mu h + b_\mu, \\
& \hat \sigma = W_\sigma h + b_\sigma, \\
& \sigma = exp( \frac{\hat \sigma}{2}), \\
& z = \mu + \sigma \odot \mathcal{N}(0,I).
\end{split}

Note that <math>z</math> is not deterministic but a random vector that can be conditioned on an input sketch sequence.

==== Decoder ====
The decoder is an autoregressive RNN. The initial hidden and cell states are generated using <math>[h_0;c_0] = \tanh(W_z z + b_z)</math>. Here, <math>c_0</math> is utilized if applicable (eg. if an LSTM decoder is used). <math>S_0</math> is defined as <math>(0,0,1,0,0)</math> (the pen is touching the paper at location 0, 0).

For each step <math>i</math> in the decoder, the input <math>x_i</math> is the concatenation of the previous point <math>S_{i-1}</math> and the latent vector <math>z</math>. The outputs of the RNN decoder <math>y_i</math> are parameters for a probability distribution that will generate the next point <math>S_i</math>.

The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as a categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1,

\begin{align*}
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1
\end{align*}

Where <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is a bi-variate Normal Distribution, with parameters means <math>\mu_x, \mu_y</math>, standard deviations <math>\sigma_x, \sigma_y</math> and correlation parameter <math>\rho_{xy}</math>. There are <math>M</math> such distributions. <math>\Pi</math> is a categorical distribution vector of length <math>M</math>. Collectively these form the mixture weights of the Gaussian Mixture model.

The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.

\begin{split}
&x_i = [S_{i-1}; z], \\
&[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), \\
&y_i = W_y h_i + b_y, \\
&y_i \in \mathbb{R}^{6M+3}. \\
\end{split}

The output consists the probability distribution of the next data point.

\begin{align*}
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i
\end{align*}

<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.

\begin{align*}
\sigma_x = \exp (\hat \sigma_x),\
\sigma_y = \exp (\hat \sigma_y),\
\rho_{xy} = \tanh(\hat \rho_{xy}).
\end{align*}

Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :

\begin{align*}
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},
k \in \left\{1,2,3\right\},
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},
k \in \left\{1,...,M\right\}.
\end{align*}

It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.

The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.

\begin{align*}
\hat q_k \rightarrow \frac{\hat q_k}{\tau},
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau},
\sigma_x^2 \rightarrow \sigma_x^2\tau,
\sigma_y^2 \rightarrow \sigma_y^2\tau.
\end{align*}

The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.

=== Unconditional Generation ===
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter <math>\tau = 0.2</math> at the top in blue, to <math>\tau = 0.9</math> at the bottom in red.

[[File:sketchfig3.png|700px|center]]

=== Training ===
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.

\begin{align*}
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})),
\end{align*}
\begin{align*}
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}),
L_R = L_s + L_p.
\end{align*}

Both terms are normalized by <math>N_{max}</math>.

<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an i.i.d. Gaussian vector with zero mean and unit variance.

\begin{align*}
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))
\end{align*}

The overall loss is weighted as:

\begin{align*}
Loss = L_R + w_{KL} L_{KL}
\end{align*}

When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.

While the aforementioned loss function could be used, it was found that annealing the KL term (as shown below) in the loss function produces better results.

<center><math>
\eta_{step} = 1 - (1 - \eta_{min})R^{step}
</math></center>

<center><math>
Loss_{train} = L_R + w_{KL} \eta_{step} max(L_{KL}, KL_{min})
</math></center>

As shown in Figure 4, the <math>L_{R} </math> metric for the standalone decoder model is actually an upper bound for different models using a latent vector. The reason is the unconditional model does not access to the entire sketch it needs to generate.

[[File:s.png|600px|thumb|center|Figure 4. Tradeoff between <math>L_{R} </math> and <math>L_{KL} </math>, for two models trained on single class datasets (left).
Validation Loss Graph for models trained on the Yoga dataset using various <math>w_{KL} </math>. (right)]]

== Experiments ==
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. HyperLSTM is a type of RNN cell that excels at sequence generation tasks. The ability for HyperLSTM to spontaneously augment its own weights enables it to adapt to many different regimes
in a large diverse dataset. They also conduct multi-class datasets. The result is as follows.

[[File:sketchtable1.png|700px|center]]

We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly. Furthermore, <math>L_R</math> decreases as <math>w_{KL} </math> is halfed.

=== Conditional Reconstruction ===
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.

[[File:sketchfig5.png|700px|center]]

They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.

=== Latent Space Interpolation ===
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.

[[File:sketchfig6.png|700px|center]]

=== Sketch Drawing Analogies ===
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.

=== Predicting Different Endings of Incomplete Sketches ===
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set <math>τ = 0.8</math> to complete samples. Figure 7 shows the results.

[[File:sketchfig7.png|700px|center]]

== Limitations ==

Although sketch-rnn can model a large variety of sketch drawings, there are several limitations in the current approach. For most single-class datasets, sketch-rnn is capable of modelling around 300 data points. The model becomes increasingly difficult to train beyond this length. For the author's dataset, the Ramer-Douglas-Peucker algorithm is used to simplify the strokes of sketch data to less than 200 data points.

For more complicated classes of images, such as mermaids or lobsters, the reconstruction loss metrics are not as good compared to simpler classes such as ants, faces or firetrucks. The models trained on these more challenging image classes tend to draw smoother, more circular line segments that do not resemble individual sketches, but rather resemble an averaging of many sketches in the training set. This smoothness may be analogous to the blurriness effect produced by a Variational Autoencoder that is trained on pixel images. Depending on the use case of the model, smooth circular lines can be viewed as aesthetically pleasing and a desirable property.

While both conditional and unconditional models are capable of training on datasets of several classes, sketch-rnn is ineffective at modelling a large number of classes simultaneously. The samples generated will be incoherent, with different classes are shown in the same sketch.

== Applications and Future Work ==
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience

This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments.
When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.

The authors conclude by providing the following future directions to this work:
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.

It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.

The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking
sketch of the object composed of a minimal number of lines to be a more interesting problem.

Moreover, it would be interesting to see how varying loss will be represented as a drawing. Some exotic form of loss function may change the way that the network behaves, which can lead to various applications.

== Conclusion ==
The paper presents a methodology to model sketch drawings using recurrent neural networks. The sketch-rnn model that can encode and decode sketches, generate and complete unfinished sketches is introduced in this paper. In addition, Authors demonstrated how to both interpolate between latent spaces from a different class, and use it to augment sketches or generate similar looking sketches. Furthermore, the importance of enforcing a prior distribution on latent vector while interpolating coherent sketch generations is shown. Finally, a large sketch drawings dataset for future research work is created.

== Critique ==
This paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches. It is very exciting to read but there are stil some aspect to improve.

* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment. The authors didn't present an evaluation of the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance. Training loss alone likely does not capture the quality of a sketch.

* Algorithm lacks comparison to the prior state of the art on standard metrics, which made the novelty unclear. Using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.

* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.

* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!

== References ==
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.
# T. White. Sampling Generative Networks. [https://arxiv.org/abs/1609.04468 ArXiv e-prints], September 2016.
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.

a neural representation of sketch drawings

2018-11-30T00:03:25Z

J385chen:

== Introduction ==
In this paper, the authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.

Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do.

The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).

=== Terminology ===
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp.

Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images.

For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images.

For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation.

== Related Work ==
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot [26, 28] and some reinforcement learning approaches[28], Reinforcement Learning to discover a set of paint brush strokes that can best represent a given input photograph. They work more like a mimic of digitized photographs. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models [25] or Mixture Density Networks [2] to generate human sketches, continuous data points (modelling Chinese characters as a sequence of pen stroke actions) or vectorized Kanji characters [9,29].

The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.

The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.

== Major Contributions ==
This paper makes the following major contributions: Authors outline a framework for both unconditional and
conditional generation of vector images composed of a sequence of lines. The recurrent neural
network-based generative model is capable of producing sketches of common objects in a vector
format. The paper develops a training procedure unique to vector images to make the training more robust. The paper also made available
a large dataset of hand drawn vector images to encourage further development of generative modelling
for vector images, and also release an implementation of our model as an open source project

== Methodology ==
=== Dataset ===
QuickDraw is a dataset with 50 million vector drawings collected by an online game [https://quickdraw.withgoogle.com/# Quick Draw!], where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.

The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.

=== Sketch-RNN ===
[[File:sketchfig2.png|700px|center]]

The model is a Sequence-to-Sequence Variational Autoencoder(VAE).

==== Encoder ====
The encoder is a bidirectional RNN. The input is a sketch sequence denoted by <math>S =\{S_0, S_1, ... S_{N_{s}}\}</math> and a reversed sketch sequence denoted by <math>S_{reverse} = \{S_{N_{s}},S_{N_{s}-1}, ... S_0\}</math>. The final hidden layer representations of the two encoded sequences <math>(h_{ \rightarrow}, h_{ \leftarrow})</math> are concatenated to form a latent vector, <math>h</math>, of size <math>N_{z}</math>,

\begin{split}
&h_{ \rightarrow} = encode_{ \rightarrow }(S), \\
&h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), \\
&h = [h_{\rightarrow}; h_{\leftarrow}].
\end{split}

Then the authors project <math>h</math> into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> of size <math>N_{z}</math>. The projection is performed using a fully connected layer. These two vectors are the parameters of the latent space Gaussian distribution that will estimate the distribution of the input data. Because standard deviations cannot be negative, an exponential function is used to convert it to all positive values. Next, a random variable with mean <math>\mu</math> and standard deviation <math>\sigma</math> is constructed by scaling a normalized IID Gaussian, <math>\mathcal{N}(0,I)</math>,

\begin{split}
& \mu = W_\mu h + b_\mu, \\
& \hat \sigma = W_\sigma h + b_\sigma, \\
& \sigma = exp( \frac{\hat \sigma}{2}), \\
& z = \mu + \sigma \odot \mathcal{N}(0,I).
\end{split}

Note that <math>z</math> is not deterministic but a random vector that can be conditioned on an input sketch sequence.

==== Decoder ====
The decoder is an autoregressive RNN. The initial hidden and cell states are generated using <math>[h_0;c_0] = \tanh(W_z z + b_z)</math>. Here, <math>c_0</math> is utilized if applicable (eg. if an LSTM decoder is used). <math>S_0</math> is defined as <math>(0,0,1,0,0)</math> (the pen is touching the paper at location 0, 0).

For each step <math>i</math> in the decoder, the input <math>x_i</math> is the concatenation of the previous point <math>S_{i-1}</math> and the latent vector <math>z</math>. The outputs of the RNN decoder <math>y_i</math> are parameters for a probability distribution that will generate the next point <math>S_i</math>.

The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as a categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1,

\begin{align*}
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1
\end{align*}

Where <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is a bi-variate Normal Distribution, with parameters means <math>\mu_x, \mu_y</math>, standard deviations <math>\sigma_x, \sigma_y</math> and correlation parameter <math>\rho_{xy}</math>. There are <math>M</math> such distributions. <math>\Pi</math> is a categorical distribution vector of length <math>M</math>. Collectively these form the mixture weights of the Gaussian Mixture model.

The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.

\begin{split}
&x_i = [S_{i-1}; z], \\
&[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), \\
&y_i = W_y h_i + b_y, \\
&y_i \in \mathbb{R}^{6M+3}. \\
\end{split}

The output consists the probability distribution of the next data point.

\begin{align*}
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i
\end{align*}

<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.

\begin{align*}
\sigma_x = \exp (\hat \sigma_x),\
\sigma_y = \exp (\hat \sigma_y),\
\rho_{xy} = \tanh(\hat \rho_{xy}).
\end{align*}

Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :

\begin{align*}
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},
k \in \left\{1,2,3\right\},
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},
k \in \left\{1,...,M\right\}.
\end{align*}

It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.

The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.

\begin{align*}
\hat q_k \rightarrow \frac{\hat q_k}{\tau},
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau},
\sigma_x^2 \rightarrow \sigma_x^2\tau,
\sigma_y^2 \rightarrow \sigma_y^2\tau.
\end{align*}

The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.

=== Unconditional Generation ===
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter <math>\tau = 0.2</math> at the top in blue, to <math>\tau = 0.9</math> at the bottom in red.

[[File:sketchfig3.png|700px|center]]

=== Training ===
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.

\begin{align*}
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})),
\end{align*}
\begin{align*}
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}),
L_R = L_s + L_p.
\end{align*}

Both terms are normalized by <math>N_{max}</math>.

<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an i.i.d. Gaussian vector with zero mean and unit variance.

\begin{align*}
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))
\end{align*}

The overall loss is weighted as:

\begin{align*}
Loss = L_R + w_{KL} L_{KL}
\end{align*}

When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.

While the aforementioned loss function could be used, it was found that annealing the KL term (as shown below) in the loss function produces better results.

<center><math>
\eta_{step} = 1 - (1 - \eta_{min})R^{step}
</math></center>

<center><math>
Loss_{train} = L_R + w_{KL} \eta_{step} max(L_{KL}, KL_{min})
</math></center>

As shown in Figure 4, the <math>L_{R} </math> metric for the standalone decoder model is actually an upper bound for different models using a latent vector. The reason is the unconditional model does not access to the entire sketch it needs to generate.

[[File:s.png|600px|thumb|center|Figure 4. Tradeoff between <math>L_{R} </math> and <math>L_{KL} </math>, for two models trained on single class datasets (left).
Validation Loss Graph for models trained on the Yoga dataset using various <math>w_{KL} </math>. (right)]]

== Experiments ==
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. HyperLSTM is a type of RNN cell that excels at sequence generation tasks. The ability for HyperLSTM to spontaneously augment its own weights enables it to adapt to many different regimes
in a large diverse dataset. They also conduct multi-class datasets. The result is as follows.

[[File:sketchtable1.png|700px|center]]

We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly. Furthermore, <math>L_R</math> decreases as <math>w_{KL} </math> is halfed.

=== Conditional Reconstruction ===
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.

[[File:sketchfig5.png|700px|center]]

They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.

=== Latent Space Interpolation ===
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.

[[File:sketchfig6.png|700px|center]]

=== Sketch Drawing Analogies ===
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.

=== Predicting Different Endings of Incomplete Sketches ===
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set <math>τ = 0.8</math> to complete samples. Figure 7 shows the results.

[[File:sketchfig7.png|700px|center]]

== Limitations ==

Although sketch-rnn can model a large variety of sketch drawings, there are several limitations in the current approach. For most single-class datasets, sketch-rnn is capable of modelling around 300 data points. The model becomes increasingly difficult to train beyond this length. For the author's dataset, the Ramer-Douglas-Peucker algorithm is used to simplify the strokes of sketch data to less than 200 data points.

For more complicated classes of images, such as mermaids or lobsters, the reconstruction loss metrics are not as good compared to simpler classes such as ants, faces or firetrucks. The models trained on these more challenging image classes tend to draw smoother, more circular line segments that do not resemble individual sketches, but rather resemble an averaging of many sketches in the training set. This smoothness may be analogous to the blurriness effect produced by a Variational Autoencoder that is trained on pixel images. Depending on the use case of the model, smooth circular lines can be viewed as aesthetically pleasing and a desirable property.

While both conditional and unconditional models are capable of training on datasets of several classes, sketch-rnn is ineffective at modelling a large number of classes simultaneously. The samples generated will be incoherent, while

== Applications and Future Work ==
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience

This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments.
When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.

The authors conclude by providing the following future directions to this work:
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.

It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.

The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking
sketch of the object composed of a minimal number of lines to be a more interesting problem.

Moreover, it would be interesting to see how varying loss will be represented as a drawing. Some exotic form of loss function may change the way that the network behaves, which can lead to various applications.

== Conclusion ==
The paper presents a methodology to model sketch drawings using recurrent neural networks. The sketch-rnn model that can encode and decode sketches, generate and complete unfinished sketches is introduced in this paper. In addition, Authors demonstrated how to both interpolate between latent spaces from a different class, and use it to augment sketches or generate similar looking sketches. Furthermore, the importance of enforcing a prior distribution on latent vector while interpolating coherent sketch generations is shown. Finally, a large sketch drawings dataset for future research work is created.

== Critique ==
This paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches. It is very exciting to read but there are stil some aspect to improve.

* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment. The authors didn't present an evaluation of the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance. Training loss alone likely does not capture the quality of a sketch.

* Algorithm lacks comparison to the prior state of the art on standard metrics, which made the novelty unclear. Using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.

* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.

* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!

== References ==
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.
# T. White. Sampling Generative Networks. [https://arxiv.org/abs/1609.04468 ArXiv e-prints], September 2016.
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.