statwiki - User contributions [US]

Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin

2018-12-09T04:52:41Z

H454chen: /* Deep Learning in Bioinformatics */

This page contains a summary of the paper [https://arxiv.org/abs/1708.00339 "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin."] by Singh, Ritambhara, et al. It was published at the Advances in Neural Information Processing Systems (NIPS) in 2017. The code for this paper is shared here[https://qdata.github.io/deep4biomed-web/].

= Background =

Gene regulation is the process of controlling which genes in a cell's DNA are turned 'on' (expressed) or 'off' (not expressed). By this process, a functional product such as a protein is created. Even though all the cells of a multicellular organism (e.g., humans) contain the same DNA, different types of cells in that organism may express very different sets of genes. As a result, each cell types have distinct functionality. In other words how a cell operates depends upon the genes expressed in that cell. Many factors including ‘Chromatin modification marks’ influence which genes are abundant in that cell.

The function of chromatin is to efficiently wraps DNA around bead-like structures of histones into a condensed volume to fit into the nucleus of a cell, and protect the DNA structure and sequence during cell division and replication. Different chemical modifications in the histones of the chromatin, known as histone marks, change spatial arrangement of the condensed DNA structure. Which in turn affects the gene’s expression of the histone mark’s neighboring region. Histone marks can promote (obstruct) the gene to be turned on by making the gene region accessible (restricted). This section of the DNA, where histone marks can potentially have an impact, is known as DNA flanking region or ‘gene region’ which is considered to cover 10k base pair centered at the transcription start site (TSS) (i.e., a 5k base pair in each direction). Unlike genetic mutations, histone modifications are reversible [1]. Therefore, understanding the influence of histone marks in determining gene regulation can assist in developing drugs for genetic diseases.

= Introduction =

Revolution in genomic technologies now enables us to profile genome-wide chromatin mark signals. Therefore, biologists can now measure gene expressions and chromatin signals of the ‘gene region’ for different cell types covering whole human genome. The Roadmap Epigenome Project (REMC, publicly available) [2] recently released 2,804 genome-wide datasets of 100 separate “normal” (not diseased) human cells/tissues, among which 166 datasets are gene expression reads and the rest are signal reads of various histone marks. The goal is to understand which histone marks are the most important and how they interact together in gene regulation for each cell type.

Signal reads for histone marks are high-dimensional and spatially structured. Influence of a histone modification mark can be anywhere in the gene region (covering 10k base pairs centered around the Transcription Start Site of each gene). It is important to understand how the impact of the mark on gene expression varies over the gene region. In other words, how histone signals over the gene region impacts the gene expression. There are different types of histone marks in human chromatin that can have an influence on gene regulation. Researchers have found five standard histone proteins. These five histone proteins can be altered in different combinations with different chemical modifications resulting in a large number of distinct histone modification marks. Different histone modification marks can act as a module to interact with each other and influence the gene expression.

This paper proposes an attention-based deep learning model to find how this chromatin factors/ histone modification marks contributes to the gene expression of a particular cell. AttentiveChrome[3] utilizes a hierarchy of multiple LSTM to discover interactions between signals of each histone marks, and learn dependencies among the marks on expressing a gene. The authors included two levels of soft attention mechanism, (1) to attend to the most relevant signals of a histone mark, and (2) to attend to the important marks and their interactions. In this context, ''attention'' refers to weighting the importance of different items differently.

== Main Contributions ==
The contributions of this work can be summarized as follows:

* More accurate predictions than the state-of-the-art baselines. This is measured using datasets from REMC on 56 different cell types.
* Better interpretation than the state-of-the-art methods for visualizing deep learning model. They compute the correlation of the attention scores of the model with the mark signal from REMC.
* Like the application of attention models previously in indirectly hinting the parts of the input that the model deemed important, AttentiveChrome can too explain it's decisions by hinting at “what” and “where” it has focused.
* This is the first time that the attention based deep learning approach is applied to a problem in molecular biology.
* Ability to deal with highly modular inputs

= Previous Works =

Machine learning algorithms to classify gene expression from histone modification signals have been surveyed by [15]. These algorithms vary from linear regression, support vector machine, and random forests to rule-based learning, and CNNs. To accommodate the spatially structured, high dimensional input data (histone modification signals) these studies applied different feature selection strategies. The preceding research study, DeepChrome [4], by the authors incorporated the best position selection strategy. The positions that are highly correlated to the gene expression are considered as the best positions. This model can learn the relationship between the histone marks. This CNN based DeepChrome model outperforms all the previous works. However, these approaches either (1) failed to model the spatial dependencies among the marks, or (2) required additional feature analysis. Only AttentiveChrome is reported to satisfy all of the eight desirable metrics of a model.

= AttentiveChrome: Model Formulation =

The authors proposed an end-to-end architecture which has the ability to simultaneously attend and predict. This method incorporates recurrent neural networks (RNN) composed of LSTM units to model the sequential spatial dependencies of the gene regions and predict gene expression level from The embedding vector, <math> h_t </math>, output of an LSTM module encodes the learned representation of the feature dependencies from the time step 0 to <math> t </math>. For this task, each bin position of the gene region is considered as a time step.

The proposed AttentiveChrome framework contains following 5 important modules:

* Bin-level LSTM encoder encoding the bin positions of the gene region (one for each HM mark)
* Bin-level <math> \alpha </math>-Attention across all bin positions (one for each HM mark)
* HM-level LSTM encoder (one encoder encoding all HM marks)
* HM-level <math> \beta </math>-Attention among all HM marks (one)
* The final classification module

Figure 1 (Supplementary Figure 2) presents an overview of the proposed AttentiveChrome framework.

[[File:supplemntary_figure_2.png|thumb|center| 800px |Figure 1: Overview of the all five modules of the proposed AttentiveChrome framework]]

== Input and Output ==

Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where <math> +1 </math> denotes gene is expressed (on) and <math> -1 </math> denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix <math> \textbf{X} </math> of size <math> M \times T </math>, where <math> M </math> is the number of HM marks considered in the input, and <math> T </math> is the number of bin positions taken into account to represent the gene region. The <math> j^{th} </math> row of the vector <math> \textbf{X} </math>, <math> x_j</math>, represents sequentially structured signals from the <math> j^{th} </math> HM mark, where <math> j\in \{1, \cdots, M\} </math>. Therefore, <math> x_j^t</math>, in the matrix <math> \textbf{X} </math> represents the value from the <math> t^{th}</math> bin belonging to the <math> j^{th} </math> HM mark, where <math> t\in \{1, \cdots, T\} </math>. If the training set contains <math>N_{tr} </math> labeled pairs, the <math> n^{th} </math> is specified as <math>( X^n, y^n)</math>, where <math> X^n </math> is a matrix of size <math> M \times T </math> and <math> y^n \in \{ -1, +1 \} </math> is the binary label, and <math> n \in \{ 1, \cdots, N_{tr} \} </math>.

Figure 2 (also refer to Figure 1 (a), and 1(b) for better understanding) exhibits the input feature, and the output of AttentiveChrome for a particular gene (one sample).

[[File:input-output-attentivechrome.png|center|thumb| 700px | Figure 2: Input and Output of the AttentiveChrome model]]

== Bin-Level Encoder (one LSTM for each HM) ==
The sequentially ordered elements (each element actually is a bin position) of the gene region of <math> n^{th} </math> gene is represented by the <math> j_{th} </math> row vector <math> x^j </math>. The authors considered each bin position as a time step for LSTM. This study incorporates bidirectional LSTM to model the overall dependencies among a total of <math> T </math> bin positions in the gene region. The bidirectional LSTM contains two LSTMs
* A forward LSTM, <math> \overrightarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_1^j </math> to <math> x_T^j </math>, which outputs the embedding vector <math> \overrightarrow{h^t_j} </math>, of size <math> d </math> for each bin <math> t </math>
* A reverse LSTM, <math> \overleftarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_T^j </math> to <math> x_1^j </math>, which outputs the embedding vector <math> \overleftarrow{h^j_t} </math>, of size <math> d </math> for each bin <math> t </math>

The final output of this layer, embedding vector at <math> t^{th} </math> bin for the <math> j^{th} </math> HM, <math> h^j_t </math>, of size <math> d </math>, is obtained by concatenating the two vectors from the both directions. Therefore, <math> h^j_t = [ \overrightarrow{h^j_t}, \overleftarrow{h^j_t}]</math>. By pairing these LSTM-based HM encoders with the final classification, embedding each HM mark by drawing out the dependencies among bins can be learned by these pairs.Figure 1 (c) illustrates the module for <math> j=2 </math>.

== Bin-Level <math> \alpha</math>-attention ==

Each bin contributes differently in the encoding of the entire <math> j^{th} </math> mark. To automatically and adaptively highlight the most important bins for prediction, a soft attention weight vector <math> \alpha^j </math> of size <math> T </math> is learned for each <math> j </math>. To calculated the soft weight <math> \alpha^j_t </math>, for each <math> t </math>, the embedding vectors <math> \{h^j_1, \cdots, h^j_t \} </math> of all the bins are utilized. The following equation is used:

<center><math> \alpha^j_t = \frac{exp(\textbf{W}_b h^j_t)}{\sum_{i=1}^T{exp(\textbf{W}_b h^j_i)}} </math></center>

<math> \alpha^j_t</math> is a scalar and is computed by all bins’ embedding vectors <math>h^j</math>. The parameter <math> W_b </math> is initialized randomly, and learned alongside during the process with the other model parameters. Therefore, once we have importance weight of each bin position, the <math> j^{th} </math> HM mark can be represented by <math> m^j = \sum_{t=1}^T{\alpha^j_t \times h^j_t}</math>. Here, <math> h^j_t</math> is the embedding vector and <math> \alpha^t_j </math> is the importance weight of the <math> t^{th} </math> bin in the representation of the <math> j^{th} </math> HM mark. Intuitively <math> \textbf{W}_b </math> will learn the cell type. Figure 1(d) shows this module for <math> HM_2 </math>.

== HM-level Encoder (one LSTM) ==

Studies observed that HMs work cooperatively to provoke or subdue gene expression [5]. The HM-level encoder (not in the fFgure 1) utilizes one bidirectional LSTM to capture this relationship between the HMs. To formulate the sequential dependency a random sequence is imagined as the authors did not find influence of any specific ordering of the HMs. The representation <math> m_j </math>of the <math> j^{th} </math> HM, <math> HM_j </math>, which is calculated from the bin-level attention layer, is the input of this step. This set based encoder outputs an embedding vector <math> s^j </math> of size <math> d’ </math>, which is the encoding for the <math> j^{th} </math> HM.

<math> s^j = [ \overrightarrow{LSTM_s}(m_j), \overleftarrow{LSTM_s}(m_j) ] </math>

The dependencies between <math> j^{th} </math> HM and the other HM marks are encoded in <math> s^j </math>, whereas <math> m^j </math> from the previous step encodes the bin dependencies of the <math> j^{th} </math> HM.

[[File:table1.png|center|thumb| 700px | Table 1: Comparison of previous studies for the task of quantifying gene expression using histonemodification marks. AttentiveChrome is the only model that exhibits all 8desirable properties.]]

== HM-Level <math> \beta</math>-attention ==
This second soft attention level (Figure 1(e)) finds the important HM marks for classifying a gene’s expression by learning the importance weights, <math> \beta_j </math>, for each <math> HM_j </math>, where <math> j \in \{ 1, \cdots, M \} </math>. The equation is

<math> \beta^j = \frac{exp(\textbf{W}_s s^j)}{\sum_{i=1}^M{exp(\textbf{W}_s s^j)}} </math>

The HM-level context parameter <math> \textbf{W}_s </math> is trained jointly in the process. Intuitively <math> \textbf{W}_s </math> learns how the HMs are significant for a cell type. Finally the entire gene region is encoded in a hidden representation <math> \textbf{v} </math>, using the weighted sum of the embedding of all HM marks.

<math> \textbf{v} = \sum_{j=1}^MT{\beta^j \times s^j}</math>

== End-to-end training ==

The embedding vector <math> \textbf{v} </math> is fed to a simple classification module, <math> f(\textbf{v}) = </math>softmax<math> (\textbf{W}_c\textbf{v}+b_c) </math>, where <math> \textbf{W}_c </math>, and <math> b_c </math> are learnable parameters. The output is the probability of gene expression being high (expressed) or low (suppressed).
The whole model including the attention modules is differentiable. Thus backpropagation can perform end-to-end learning trivially. The negative log-likelihood loss function is minimized in the learning.

= Experimental Settings =

This work makes use of the REMC dataset. AttentiveChrome is evaluated on 56 different cell types. Similar to DeepChrome, this study considered the following five core HM marks (<math> M=5 </math>). Because these selected marks are uniformly profiled across all 56 cell types in the REMC study.

[[File:HM.png|center|thumb| 700px | Table 1: Five core HM marks and their attributes considered in this paper]]

For a gene region 10k base pairs centred at the TSS site (5k bp in each direction) are taken into account. These 10k base pairs are divided into 100 bins, each bin consisting of <math> T=100 </math> continuous bp). Therefore, for each gene in a particular cell type, the input matrix will be of size <math> 5 \times 100 </math>. The gene expression labels are normalized and discretized to represent binary labelling. The sample dataset is divided into three equal sized folds for training, validation, and testing.

== Model Variations and Two Baselines ==
To evaluate the performance of the proposed model the authors considered RNN method (direct LSTM without any attention), and their prior work DeepChrome as baselines. The results obtained from multiple variations of the AttentiveChrome model are compared with the baselines. The authors considered five variant of AttentiveChrome during performance evaluation. The variants are:

* LSTM-Attn: one LSTM with attention on the input matrix (does not consider the modular nature of HM marks)
* CNN-Attn: DeepChrome [4] with one attention mechanism incorporated.
* LSTM-<math>\alpha , \beta</math>: the proposed architecture.
* CNN-<math>\alpha , \beta</math>: LSTM module of the proposed architecture replaced with CNN. This variation includes two attention mechanisms. First attention mechanism contains one <math>\alpha</math>-attention on top of a CNN module per HM mark. And, the second -<math>\beta</math>- attention mechanism is used to combine HMs.
* LSTM-<math>\alpha</math>: one LSTM and <math>\alpha</math>-attention per HM mark.

== Hyperparameters ==

For all the variants of AttentiveChrome the bin-level LSTM embedding size <math> d</math> is set to 32, and the HM-level LSTM embedding size <math>d’</math> is set to 16. Because of bidirectional LSTM, the size of the embedding vector <math> h_t</math>, and <math>m_j</math> will be 64, and 32 respectively. Size of the context vectors are set accordingly.

= Performance Evaluation =

== AUC Scores ==

This study summarizes AUC scores across all 56 cell types on the test set to compare the methods.

[[File:AUC.JPG|center|thumb| 700px | Table 2: AUC score performances for different variations of AttentiveChrome and baselines]]

Overall the LSTM-attention models perform better than the DeepChrome (CNN-based) and LSTM baselines. The authors argue that the proposed AttentiveChrome model is a good choice because of its interpretability, even though the performance improvement from DeepChrome is insignificant.

== Evaluation of Attention Scores for Interpretation ==

To understand if the model is focusing on the right regions, the authors make use of additional study results from REMC database. To validate the bin attention,signal data of a new histone mark, H3K27ac, referred to as <math>H_{active}</math> in this article, from REMC database is utilized. This particular histone mark is known to mark active region when the gene is expressed (ON). Genome-wide read of this HM mark is available for three important cell types: stem cell (H1-hESC), blood cell (GM12878), and leukemia cell (K562). This particular HM mark is used to analyze the visualization results only and not applied in the learning phase. The authors discussed performance of both the attention mechanisms in this section.

=== Correlation of Importance Weight of <math>H_{prom}</math> with <math>H_{active}</math> ===

Average read count of <math>H_{active}</math> across all 100 bins for all the active genes (ON or labeled as <math>+1</math>) in the three selected cell types is calculated. The proposed AttentiveChrome and LSTM-<math>\alpha</math> methods are compared with two widely used visualization techniques, (1) class based, and (2) saliency map applied on the baseline DeepChrome model (CNN-based prior work). Using these visualization methods, the authors calculate the importance weights for <math>H_{prom}</math> (promoter HM mark used in training) across the 100 bins. The Pearson Correlation score between these importance weights and the read count of the <math>H_{active}</math> (HM mark for validation) across the same 100 bins is computed. The <math>H_{active}</math> read counts indicates the actual active regions of those cells.

[[File: pc.JPG|center|thumb| 700px | Figure 4: Pearson Correlation between a known active HM mark]]

The results indicate that the proposed models consistently gained highest correlation with <math>H_{active}</math> for all three cell types. Thus, the proposed method is successful to capture the important signals.

=== Visualization of Attention Weight of bins for each HM of a specific cell type GM12878===

To visualize bin level attention weights, the authors plotted the average bin-level attention weights for each HM for a specific cell type GM12878 (blood cell) for expressed (ON) genes and suppressed (OFF) genes separately.

[[File: figure2.png|center|thumb| 700px |]]

For the “ON” genes, the attention profiles are well defined for the HM marks, <math>H_{prom}</math>, <math>H_{enhc}</math>, <math>H_{struct}</math>. On the other hand, the weights are low for <math>H_{reprA}</math> and <math>H_{reprB}</math>. The average trend reverses for the “OFF” genes, where the repressor HM marks have more influence than the <math>H_{prom}</math>, <math>H_{enhc}</math>, <math>H_{struct}</math>. This observation agrees with the biologist finding that <math>H_{prom}</math>, <math>H_{enhc}</math>, <math>H_{struct}</math> marks stimulates gene activation and, <math>H_{reprA}</math> and <math>H_{reprB}</math> mark restrains the genes.

=== Attention Weight of bins with <math>H_{active}</math>===

The average read counts of <math>H_{active}</math> for the same 100 bins across all the active (ON) genes for the cell type GM12878 is plotted (FIGURE 2(b)). Besides, for AttentiveChrome the plot of bin-level attention weights of averaged over all the genes that are PREDICTED ON for GM12878 is also provided. The plots exhibit that the <math>H_{prom}</math> profile is similar to <math>H_{active}</math>.

=== Visualization of HM-level Attention Weight for Gene PAX5 ===

To visualize HM-level attention weight the authors produces a heatmap for a differentially regulated gene, PAX5, for the three aforementioned cell types. The heatmap is presented in FIGURE 2(c). PAX5 plays significant role in gene regulation when stem cells convert to blood cells. This gene is OFF in stem cells (H1-hESC), however it becomes activated when the stem cell is transformed into blood cell (GM12878). The <math>\beta_j</math> weight for <math>H_{repr}</math> is high when the gene is OFF in H1-hESC, and the weight decreases when the gene is ON in GM12878. On the contrary, for <math>H_{prom}</math> mark the <math>\beta_j</math> weight increases from H1-hESC to GM12878 as the gene becomes activated. This information extracted by the deep learning model is also supported by biological literature [16].

= Related Works/Studies =

In the last few years, deep learning models obtained models obtained unprecedented success in diverse research fields. Though as not rapidly as other fields, deep learning based algorithms are gaining popularity among bioinformaticians.

== Attention-based Deep Models ==

The idea of attention technique in deep learning is adapted from the human visual perception system. Humans tend to focus over some parts more than the others while perceiving a scene. This mechanism augmented with deep neural networks achieved an excellent outcome in several research topics, such as machine translation. Various types of attention models e.g., soft [6], or location-aware [7], or hard [8, 9] attentions have been proposed in the literature. In the soft attention model, a soft weight vector is calculated for the overall feature vectors. The extent of the weight is correlated with the degree of importance of the feature in the prediction. In practice, RNN is often used to help implement such models.

== Visualization and Apprehension of Deep Models ==

Prior studies mostly focused on interpreting convolutional neural networks (CNN) for image classification. Deconvulation approaches [10] attempt to map hidden layer representations back to an input space. Saliency maps [11, 12], attempt to use taylor expansion to approximate the network, and identify the most relevant input features. Class optimization [12] based visualization techniques attempt to find the best example member of each class. Some recent research works [13, 14] tried to understand recurrent neural networks (RNN) for text-based problems. By looking into the features the model attends to, we can interpret the output of a deep model.

== Deep Learning in Bioinformatics ==
Deep learning is also getting popular in bioinformatics fields because it is able to extract meaningful representations from datasets. Scholars use deep learning to model protein sequences and DNA sequences and predicting gene expressions, as well as making-sense of the effects of non-coding variants.

== Previous model for gene expression predictions ==
There were multiple machine learning models had been used to predict gene expressions from histone modification data (surveyed in [19]), such as linear regression[21], random forests[18], rule-based learning [19] and CNNs [22] and support vector machines[17].These studies designed different feature selection strategies to accommodate a large amount of histone modification signals as input. The strategies included using signal averaging across all relevant positions and selecting input signals at positions where was highly correlated to target gene expression and then use CNN (called DeepChrome [22]) to learn combinatorial interactions among histone modification marks. DeepChrome outperformed all previous methods (see Supplementary) on this task and used a class optimization-based technique for visualizing the learned model. However, this class-level visualization lacks the necessary granularity to understand the signals from multiple chromatin marks at the individual gene level.

= Conclusion =

The paper has introduced an attention-based approach called "AttentiveChrome" that deals with both understanding and prediction with several advantages on previous architectures including higher accuracy from state-of-the-art baselines, clearer interpretation than saliency map, which allows them to view what the model ‘sees’ during prediction, and class optimization. Another advantage of this approach is that it can model modular feature inputs which are sequentially structured. Finally, according to the authors, this is the first implementation of deep attention to understand gene regulation. AttentiveChrome is claimed to be the first attention based model applied on a molecular biology dataset. The authors expect that through this deep attention mechanism, the biologists can have a better understanding of epigenomic data. It can model feature inputs that are sequentially structured. This model can handle understanding and prediction of hard to interpret biological data as it grants insights
to the predictions by locating ‘what’ and ‘where’ AttentiveChrome has focused.

= Critiques =

This paper does not give a considerable algorithmic contribution. They have only used existing methods for this application. This deep learning based method is shown to perform better than simple machine learning models like linear regression and SVMs but this is considerably harder to implement and has many more hyperparameters to tune. The training time is considerably higher, especially because all the parameters are learned together. The dataset considered in the application here also seems to have only a limited number of samples for a study of high complexity. Model hyperparameters have been chosen randomly without any explanation of intuition for them. The authors have also not cited any relevant literature to understand where these numbers came from.

Discussion about attention scores for interpretation does not provide any clear definition or mention previous literature using them. Reference of literature about H3K27ac, and how its read counts represent active region of a cell should be included. No reasoning given for why only one specific cell type is used to visualize bin level attention weights. Example of some other real world problems where this model can be useful should be provided.

Moreover, this paper relies heavily on the intuition. Due to complicated structures, it must be challenging to provide algorithmic/theoretical justifications. This means that there is no proper guidence of how hyperparameters should be chosen or any kinds of treatment that the author performs on other data sets.

= Additional Resources =

# [https://qdata.github.io/deep4biomed-web/ Official DeepChrome Website]
# [http://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin-supplemental.zip Supplemental Resources]
# [https://github.com/QData/AttentiveChrome/blob/master/NIPS%20poster.pdf Poster]
# [https://www.youtube.com/watch?v=tfgmXvSgsQE&feature=youtu.be Video Presentation]

= Reference =

[1] Andrew J Bannister and Tony Kouzarides. Regulation of chromatin by histone modifications. Cell Research, 21(3):381–395, 2011.

[2] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.

[3] Singh, Ritambhara, et al. "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin." Advances in Neural Information Processing Systems. 2017.

[4] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.

[5] Joanna Boros, Nausica Arnoult, Vincent Stroobant, Jean-François Collet, and Anabelle Decottignies. Polycomb repressive complex 2 and h3k27me3 cooperate with h3k9 methylation to maintain heterochromatin protein 1α at chromatin. Molecular and cellular biology, 34(19):3662–3674, 2014.

[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[7] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 577–585. Curran Associates, Inc., 2015.

[8] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics.

[9] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016.

[10] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pages 818–833. Springer, 2014.

[11] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert MÃžller. How to explain individual classification decisions. volume 11, pages 1803–1831, 2010.

[12] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. 2013.

[13] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and understanding recurrent networks. 2015.

[14] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. 2015.

[15] Xianjun Dong and Zhiping Weng. The correlation between histone modifications and gene expression. Epigenomics, 5(2):113–116, 2013.

[16] Shane McManus, Anja Ebert, Giorgia Salvagiotto, Jasna Medvedovic, Qiong Sun, Ido Tamir, Markus Jaritz, Hiromi Tagoh, and Meinrad Busslinger. The transcription factor pax5 regulates its target genes by recruiting chromatin-modifying proteins in committed b cells. The EMBO journal, 30(12):2388–2404, 2011.

[17] ChaoCheng,Koon-KiuYan,KevinYYip,JoelRozowsky,RogerAlexander,ChongShou,MarkGerstein, et al. A statistical framework for modeling gene expression using chromatin features and application to modencode datasets. Genome Biol, 12(2):R15, 2011.

[18] XianjunDong,MelissaCGreven,AnshulKundaje,SarahDjebali,JamesBBrown,ChaoCheng,ThomasR Gingeras, Mark Gerstein, Roderic Guigó, Ewan Birney, et al. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol, 13(9):R53, 2012.

[19] Xianjun Dong and Zhiping Weng. The correlation between histone modifications and gene expression. Epigenomics, 5(2):113–116, 2013.

[20] Bich Hai Ho, Rania Mohammed Kotb Hassen, and Ngoc Tu Le. Combinatorial roles of dna methylation and histone modifications on gene expression. In Some Current Advanced Researches on Information and Computer Science in Vietnam, pages 123–135. Springer, 2015.

[21] Rosa Karlic ́, Ho-Ryun Chung, Julia Lasserre, Kristian Vlahovicˇek, and Martin Vingron. Histone mod- ification levels are predictive for gene expression. Proceedings of the National Academy of Sciences, 107(7):2926–2931, 2010.

[22] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-12-09T04:48:32Z

H454chen: /* Experiments */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

Time series forecasting is focused on modeling the predictors of future values of time series given their past. As in many cases the relationship between past and future observations is not deterministic, this amounts to expressing the conditional probability distribution as a function of the past observations: The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
This forecasting problem has been approached almost independently by econometrics and machine learning communities. In this paper, the authors focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecasted using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through the time gate. The LSTM architecture has three "gates", the input gate, the forget gate, and the update gate. It performs well in practice because it allows the RNN architecture to be able to take into account events happened a long time ago. Traditionally, RNN architectures are heavily influenced by recent events, but LSTM overcomes that by updating the weights in the three newly introduced gates.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanism for neural networks has ability to overcome the problem of vanishing gradients, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid non-linearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as, the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are modelled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM, Phased LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here].

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value. Note that a series with K sources is K + 1-dimensional in synchronous case and K + 2-dimensional in asynchronous case. The base series in all processes was a stationary AR(10) series. Although that series has the true order of 10, in the experimental setting the input data included past 60 observations. The rationale behind that is twofold: not only is the data observed in irregular random times but also in real–life problems the order of the model is unknown.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

[[File:async.png | 520px|center|]]

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with a ratio of 3 to 1. The remaining 20% of the data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets. The artificial and electricity data was optimized using one NVIDIA K20, while the quotes data used only an Intel Core i7-6700 CPU.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on an artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have a negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on an asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible. This makes it hard to justify the introduction of the auxiliary loss function <math>L^{aux}</math>.

Also, using artificial dataset as the experimental result is not a good practice in this paper. This is essentially an application paper, and such dataset makes results hard to reproduce, and cannot support the performance claim of the model.

[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has a significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 datasets and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple lines and green lines seem to stay at the same position in the training and testing process. SOCNN and single-layer LSTM are most robust and least prone to overfitting comparing to other networks.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness. It helped achieve more stable test error throughout training in many cases.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.
#The paper does not specify how training and testing set are separated in detail, which is quite important in time-series problems. Moreover, rolling or online-based learning scheme should be used in comparison, since they are standard in time-series prediction tasks.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

ShakeDrop Regularization

2018-12-09T04:38:41Z

H454chen: /* Existing Methods */

=Introduction=
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. Note that the authors of Shake-Shake are rejecting the claim of their memory inefficiency. They claimed that there is no memory issue, just because there are <math>2\times</math> branches doesn't mean Shake-Shake needs <math>2\times</math> memory as it can use less memory to achieve the same performance.

To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed.ShakeDrop disturbs learning more strongly by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. In addition, a different factor from the forward pass is multiplied in the backward training pass. As a byproduct, however, learning process gets unstable. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.

=Existing Methods=

'''Deep Approaches'''

'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.

The intuition behind Residual blocks:
If the identity mapping is optimal, We can easily push the residuals to zero (F(x) = 0) than to fit an identity mapping (x, input=output) by a stack of non-linear layers. In simple language it is very easy to come up with a solution like F(x) =0 rather than F(x)=x using stack of non-linear cnn layers as function (Think about it). So, this function F(x) is what the authors called Residual function ([https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624 Reference]).

Residual blocks are used for two main reasons. First, as our networks become “deeper” and more flexible, we also need to take many more gradients during backpropagation. This exponentially increases the risk of vanishing gradients, particularly with state-of-the art structures. To counter this, residual layers pass entire layers – with the identity function applied – further down the network. Intuitively, this gives higher gradient values. Secondly, this gives the network another path to work on. If forced non-linearity is not an optimal choice, the network can bypass it through these residual blocks. In combination, residual blocks faciliate training of deep neural networks.

[[File:ResidualBlock.png|580px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]

ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.

'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.

[[File:ResidualBlockComparison.png|980px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017. The width of a block reflects the number of channels used in that layer.]]

'''Non-Deep Approaches'''

'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.

'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.

[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]

'''Regularization Methods For Residual Blocks'''

'''Stochastic Depth''' works by randomly dropping paths in the residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Unlike sequential networks, there are many paths from the input to the output in these networks. By dropping some of the connections, the network is forced to flow through different paths to get the final deep layer representation. In a way it is similar to dropout, but for paths in multi-path networks. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. Essentially, the probability of a connection dropping in inversely proportional to the its depth in the network.

'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt (multiple residual connections) architecture. It is given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. Essentially, one of the parallel residual connections is dropped in the forward direction. This is similar to stochastic depth regularization, but a residual path always exists.
Moreover, on the backward pass a similar random variable <math>\beta</math> is used to independently drop paths for gradient flow. This has the effect of adding noise in the gradients update process and improved performance over the vanilla ResNeXt network.

[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]

=Proposed Method=
We give an intuitive interpretation of the forward pass of Shake-Shake regularization. To the best of our knowledge, it has not been given yet, while the phenomenon in the backward pass is experimentally investigated by Gastaldi (2017). In the forward pass, Shake-Shake interpolates the outputs of two residual branches with a random variable α that controls the degree of interpolation. As DeVries & Taylor (2017a) demonstrated that interpolation of two data in the feature space can synthesize reasonable augmented data, the interpolation of two residual blocks of Shake-Shake in the forward pass can be interpreted as synthesizing data. Use of a random variable α generates many different augmented data. On the other hand, in the backward pass, a different random variable β is used to disturb learning to make the network learnable long time. Gastaldi (2017) demonstrated how the difference between <math>\alpha</math> and <math>\beta</math> affects.

The regularization mechanism of Shake-Shake relies on two or more residual branches, so that it can be applied only to 2-branch networks architectures. In addition, 2-branch network architectures consume more memory than 1-branch network architectures. One may think the number of learnable parameters of ResNeXt can be kept in 1-branch and 2-branch network architectures by controlling its cardinality and the number of channels (filters). For example, a 1-branch network (e.g., ResNeXt 1-64d) and its corresponding 2-branch network (e.g., ResNeXt 2-40d) have almost same number of learnable parameters. However, even so, it increases memory consumption due to the overhead to keep the inputs of residual blocks and so on. By comparing ResNeXt 1-64d and 2-40d, the latter requires more memory than the former by 8% in theory (for one layer) and by 11% in measured values (for 152 layers).

This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.

This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two.

'''ShakeDrop''' is given as

<div align="center">
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,
</div>

where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is

<div align="center">
<math>
G(x) = \begin{cases}
x + F(x) ~~ \text{if } b_l = 1 \\
x + \alpha F(x) ~~ \text{otherwise}
\end{cases}
</math>
</div>

If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.

=Experiments=

'''Parameter Search'''

The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below.

[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]

The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.

[[File:ParameterUpdateShakeDrop.png|400px|centre]]

Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.

'''Comparison with Regularization Methods'''

For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch).

These experiments are performed on the CIFAR-100 dataset.

[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]

For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing.

[[File:CosineAnnealing.png|400px|centre|thumb|]]

The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017).

[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]

'''State-of-the-Art Comparisons'''

A direct comparison with state of the art methods is favorable for this new method.

# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.
# Comparison with the state-of-the-arts, PyramidNet + ShakeDrop gives an improvement of 0.25% on CIFAR-10 than ResNeXt + Shake-Shake + Cutout, PyramidNet + ShakeDrop gives an improvement of 2.85% on CIFAR-100 than Coupled Ensemble.

=Implementation details=

'''CIFAR-10/100 datasets'''

All the images in these datasets were color normalized and then horizontally flipped with a probability of 50%. All of then then were zero padded to have a dimentionality of 40 by 40 pixels.

=Conclusion=
The paper proposes a new form of regularization that is an extension of "Shake-Shake" regularization [Gastaldi, 2017]. The original "shake-shake" proposes using two residual paths adding to the same output, and during training, considering different randomly selected convex combinations of the two paths (while using an equally weighted combination at test time). This paper contends that this requires additional memory, and attempts to achieve similar regularization with a single path. To do so, they train a network with a single residual path, where the residual is included without attenuation in some cases with some fixed probability, and attenuated randomly (or even inverted) in others. The paper contends that this achieves superior performance than choosing simply a random attenuation for every sample (although, this can be seen as choosing an attenuation under a distribution with some fixed probability mass.

Their stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.

=Critique=

The novelty of this paper is low as pointed out by the reviewers. Also, there is a confusion whether or not the results could be replicated as <math>\alpha</math> and <math>\beta</math> are choosen randomly. The proposed ShakeDrop regularization is essentially a combination of the PyramidDrop and Shake-Shake regularization. The most surprising part is that the forward weight can be negative thus inverting the output of a convolution. The mathematical justification for ShakeDrop regularization is limited, relying on intuition and empirical evidence instead.

One downside of this methods (as was identified in the presentation as well) is that the training for cosine annealing variation of the model takes 1800 epochs which is time intensive compared to other methods that were compared as baselines. This can limit practical implementation of this algorithm.

As pointed out from the above, the method basically relies heavily on the intuition. This means that the performance of the algorithm can not been extended beyond the CIFAR dataset and can vary a lot depending on the characteristics of data sets that users are performing, with some exaggeration. However, the performance is still impressive since it performs better than known algorithms. It is not clear as to how the proposed technique would work with a non-residual architecture.
It lacks conclusive proof that "shake-drop" is a generically useful regularization technique. For one, the method is evaluated only on small toy-datasets: CIFAR-10 and CIFAR-100. Evaluation on Imagenet perhaps would have been valuable. There is also another dataset that would of been good to try SVHN. Overall I believe the impact of this beyond CIFAR is unclear.

=References=
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.

[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.

[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.

[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.

[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.

[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.

[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.

[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.

[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.

CapsuleNets

2018-12-09T04:28:27Z

H454chen: /* Motivation */

The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "[https://openreview.net/pdf?id=HJWLfGWRb Matrix Capsules with EM Routing]" for ICLR 2018.

=Motivation=

Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.

The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. This paper explores an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity. The length of the vector output of a capsule cannot exceed 1 because of an application of a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.

The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. The authors demonstrate that their dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects

==Adversarial Examples==

First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally to fool a machine learning model. An example of an adversarial example is shown below:

[[File:adversarial_img_1.png ‎|center]]

To human eyes, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defenses are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.

==Drawbacks of CNNs==
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a <math>k \cdot k</math> kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features but causes valuable spatial information to be lost.

Also, in CNNs, higher-level features combine lower-level features as a weighted sum: activations of a previous layer multiplied by the current layer's weight, then passed to another activation function. In this process, pose relationship between simpler features is not part of the higher-level feature.

In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.
In deep learning, the activation level of a neuron is often interpreted as the likelihood of detecting a specific feature. CNNs are good at detecting features but less effective at exploring the spatial relationships among features (perspective, size, orientation).

[[File:Equivariance Face.png ‎|center]]

Here, the CNN could wrongly activate the neuron for the face detection. Without realizing the mismatch in spatial orientation and size, the activation for the face detection will be too high.

Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.

[[File:kitten.jpeg ‎|center]]

[[File:kitten-rotated-180.jpg ‎|center]]

For a more in-depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).

==Intuition for Capsules==
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed.

To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).

[[File:Rotational Invariance.jpeg ‎|center]]

Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks requires.

=Background, Notation, and Definitions=

==What is a Capsule==
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting, and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."

In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.

A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:

<blockquote>
A capsule network consists of several layers of capsules. The set of capsules in layer <math>L</math> is denoted
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the
activities in a standard neural net: they depend on the current input and are not stored. In between
each capsule <math>i</math> in layer <math>L</math> and each capsule <math>j</math> in layer <math>L + 1</math> is a 4x4 trainable transformation matrix,
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they
are learned discriminatively. The pose matrix of capsule <math>i</math> is transformed by <math>W_{ij}</math> to cast a vote
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule <math>j</math>. The poses and activations of all the capsules in layer
<math>L + 1</math> are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all
<math>i \in \Omega_L, j \in \Omega_{L+1}</math>
</blockquote>
<math></math>

==Notation==

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0.

\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}

where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.

For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}_i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math>

\begin{align}
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i
\end{align}
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.

The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.

\begin{align}
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}
\end{align}

=Network Training and Dynamic Routing=

==Understanding Capsules==
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.

[[File:CapsuleNets.jpeg|center|800px]]

The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.

We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.

[[File:Predictions.jpeg ‎|center]]

==Dynamic Routing==
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math>

In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.
[[File:Dynamic Routing.png|center|900px]]

From the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observations into capsule K; we adjust the corresponding weights upward during training.

These weights are determined through the dynamic routing procedure:

[[File:Routing Algo.png‎|900px]]

Note that the convergence of this routing procedure has been questioned. Although it is empirically shown that this procedure converges, the convergence has not been proven.

Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper was released in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).

=Architecture=
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.

==Loss Function==
[[File:Loss Function.png‎|900px]]

The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.

A graphical representation of loss function values under varying vector norms is given below.
[[File:Loss function chart.png|900px]]

==Encoder Layers==
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising.

[[File:Architecture.png|center|900px]]

The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.

'''Layer 1: Convolution''':
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.

'''Layer 2: PrimaryCaps''':
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.

'''Layer 3: DigiCaps''':
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.

==Decoder Layers==
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.

[[File:Decoder.png|center|900px]]

The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.

In addition to the digicaps loss function, we add reconstruction error as a form of regularization. During training, everything but the activity vector of the correct digit capsule is masked, and then this activity vector is used to reconstruct the input image. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.

[[File:Reconstruction.png|center|900px]]

=MNIST Experimental Results=

==Accuracy==
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.

[[File:Accuracies.png|center|900px]]

==What Capsules Represent for MNIST==
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties.
[[File:CapsuleReps.png|center|900px]]

One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.

==Robustness of CapsNet==
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.

To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.

=MultiMNIST & Other Experiments=

==MultiMNIST==
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).

There are some additional steps to generating the MultiMNIST dataset.

1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)

2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.

[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.
The two reconstructed digits are overlayed in green and red as the lower image. The upper image
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)
represents the two digits used for reconstruction. The two right most columns show two examples
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have
correct classifications and show that the model accounts for all the pixels while being able to assign
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a
digit that is neither the label nor the prediction. These columns suggest that the model is not just
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other
support.]]

==Other datasets==
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.

The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.

=Critique=
Although the network performs incredibly favorable in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.

Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.

Moreover, there is no underlying intuition provided on the main point of the paper which is that capsule nets preserve relations between extracted features from the proposed architecture. An explanation on the intuition behind this idea will go a long way in arguing against CNN networks.

Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done.

Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.

* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.

=Future Work=
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.
Moreover, the authors hint towards trying to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.

Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.

=References=
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]
#Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.

a neural representation of sketch drawings

2018-12-04T05:51:41Z

H454chen: /* Related Work */

== Introduction ==
In this paper, the authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.

Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do.

The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).

=== Terminology ===
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp.

Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images.

For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images.

For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation.

== Related Work ==
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot [26, 28] and some reinforcement learning approaches[28], Reinforcement Learning to discover a set of paint brush strokes that can best represent a given input photograph. They work more like a mimic of digitized photographs. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models [25] or Mixture Density Networks [2] to generate human sketches, continuous data points (modelling Chinese characters as a sequence of pen stroke actions) or vectorized Kanji characters [9,29].

Neural Network-based approaches are able to generate latent space representation of vector images, which follows a Gaussian distribution. The generated output of these networks is trained to match the Gaussian distribution by minimizing a given loss function. Using this idea, previous works attempted to generate a sequence-to-Sequence model with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.

The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.

== Major Contributions ==
This paper makes the following major contributions: Authors outline a framework for both unconditional and
conditional generation of vector images composed of a sequence of lines. The recurrent neural
network-based generative model is capable of producing sketches of common objects in a vector
format. The paper develops a training procedure unique to vector images to make the training more robust. The paper also made available
a large dataset of hand drawn vector images to encourage further development of generative modelling
for vector images, and also release an implementation of our model as an open source project

== Methodology ==
=== Dataset ===
QuickDraw is a dataset with 50 million vector drawings collected by an online game [https://quickdraw.withgoogle.com/# Quick Draw!], where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.

The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.

=== Sketch-RNN ===
[[File:sketchfig2.png|700px|center]]

The model is a Sequence-to-Sequence Variational Autoencoder(VAE).

==== Encoder ====
The encoder is a bidirectional RNN. The input is a sketch sequence denoted by <math>S =\{S_0, S_1, ... S_{N_{s}}\}</math> and a reversed sketch sequence denoted by <math>S_{reverse} = \{S_{N_{s}},S_{N_{s}-1}, ... S_0\}</math>. The final hidden layer representations of the two encoded sequences <math>(h_{ \rightarrow}, h_{ \leftarrow})</math> are concatenated to form a latent vector, <math>h</math>, of size <math>N_{z}</math>,

\begin{split}
&h_{ \rightarrow} = encode_{ \rightarrow }(S), \\
&h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), \\
&h = [h_{\rightarrow}; h_{\leftarrow}].
\end{split}

Then the authors project <math>h</math> into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> of size <math>N_{z}</math>. The projection is performed using a fully connected layer. These two vectors are the parameters of the latent space Gaussian distribution that will estimate the distribution of the input data. Because standard deviations cannot be negative, an exponential function is used to convert it to all positive values. Next, a random variable with mean <math>\mu</math> and standard deviation <math>\sigma</math> is constructed by scaling a normalized IID Gaussian, <math>\mathcal{N}(0,I)</math>,

\begin{split}
& \mu = W_\mu h + b_\mu, \\
& \hat \sigma = W_\sigma h + b_\sigma, \\
& \sigma = exp( \frac{\hat \sigma}{2}), \\
& z = \mu + \sigma \odot \mathcal{N}(0,I).
\end{split}

Note that <math>z</math> is not deterministic but a random vector that can be conditioned on an input sketch sequence.

==== Decoder ====
The decoder is an autoregressive RNN. The initial hidden and cell states are generated using <math>[h_0;c_0] = \tanh(W_z z + b_z)</math>. Here, <math>c_0</math> is utilized if applicable (eg. if an LSTM decoder is used). <math>S_0</math> is defined as <math>(0,0,1,0,0)</math> (the pen is touching the paper at location 0, 0).

For each step <math>i</math> in the decoder, the input <math>x_i</math> is the concatenation of the previous point <math>S_{i-1}</math> and the latent vector <math>z</math>. The outputs of the RNN decoder <math>y_i</math> are parameters for a probability distribution that will generate the next point <math>S_i</math>.

The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as a categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1,

\begin{align*}
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1
\end{align*}

Where <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is a bi-variate Normal Distribution, with parameters means <math>\mu_x, \mu_y</math>, standard deviations <math>\sigma_x, \sigma_y</math> and correlation parameter <math>\rho_{xy}</math>. There are <math>M</math> such distributions. <math>\Pi</math> is a categorical distribution vector of length <math>M</math>. Collectively these form the mixture weights of the Gaussian Mixture model.

The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.

\begin{split}
&x_i = [S_{i-1}; z], \\
&[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), \\
&y_i = W_y h_i + b_y, \\
&y_i \in \mathbb{R}^{6M+3}. \\
\end{split}

The output consists the probability distribution of the next data point.

\begin{align*}
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i
\end{align*}

<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.

\begin{align*}
\sigma_x = \exp (\hat \sigma_x),\
\sigma_y = \exp (\hat \sigma_y),\
\rho_{xy} = \tanh(\hat \rho_{xy}).
\end{align*}

Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :

\begin{align*}
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},
k \in \left\{1,2,3\right\},
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},
k \in \left\{1,...,M\right\}.
\end{align*}

It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.

The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.

\begin{align*}
\hat q_k \rightarrow \frac{\hat q_k}{\tau},
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau},
\sigma_x^2 \rightarrow \sigma_x^2\tau,
\sigma_y^2 \rightarrow \sigma_y^2\tau.
\end{align*}

The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.

=== Unconditional Generation ===
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter <math>\tau = 0.2</math> at the top in blue, to <math>\tau = 0.9</math> at the bottom in red.

[[File:sketchfig3.png|700px|center]]

=== Training ===
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.

\begin{align*}
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})),
\end{align*}
\begin{align*}
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}),
L_R = L_s + L_p.
\end{align*}

Both terms are normalized by <math>N_{max}</math>.

<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an i.i.d. Gaussian vector with zero mean and unit variance.

\begin{align*}
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))
\end{align*}

The overall loss is weighted as:

\begin{align*}
Loss = L_R + w_{KL} L_{KL}
\end{align*}

When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.

While the aforementioned loss function could be used, it was found that annealing the KL term (as shown below) in the loss function produces better results.

<center><math>
\eta_{step} = 1 - (1 - \eta_{min})R^{step}
</math></center>

<center><math>
Loss_{train} = L_R + w_{KL} \eta_{step} max(L_{KL}, KL_{min})
</math></center>

As shown in Figure 4, the <math>L_{R} </math> metric for the standalone decoder model is actually an upper bound for different models using a latent vector. The reason is the unconditional model does not access to the entire sketch it needs to generate.

[[File:s.png|600px|thumb|center|Figure 4. Tradeoff between <math>L_{R} </math> and <math>L_{KL} </math>, for two models trained on single class datasets (left).
Validation Loss Graph for models trained on the Yoga dataset using various <math>w_{KL} </math>. (right)]]

== Experiments ==
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. HyperLSTM is a type of RNN cell that excels at sequence generation tasks. The ability for HyperLSTM to spontaneously augment its own weights enables it to adapt to many different regimes
in a large diverse dataset. They also conduct multi-class datasets. The result is as follows.

[[File:sketchtable1.png|700px|center]]

We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly. Furthermore, <math>L_R</math> decreases as <math>w_{KL} </math> is halfed.

=== Conditional Reconstruction ===
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.

[[File:sketchfig5.png|700px|center]]

They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.

=== Latent Space Interpolation ===
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.

[[File:sketchfig6.png|700px|center]]

=== Sketch Drawing Analogies ===
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.

=== Predicting Different Endings of Incomplete Sketches ===
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set <math>τ = 0.8</math> to complete samples. Figure 7 shows the results.

[[File:sketchfig7.png|700px|center]]

== Limitations ==

Although sketch-rnn can model a large variety of sketch drawings, there are several limitations in the current approach. For most single-class datasets, sketch-rnn is capable of modelling around 300 data points. The model becomes increasingly difficult to train beyond this length. For the author's dataset, the Ramer-Douglas-Peucker algorithm is used to simplify the strokes of sketch data to less than 200 data points.

For more complicated classes of images, such as mermaids or lobsters, the reconstruction loss metrics are not as good compared to simpler classes such as ants, faces or firetrucks. The models trained on these more challenging image classes tend to draw smoother, more circular line segments that do not resemble individual sketches, but rather resemble an averaging of many sketches in the training set. This smoothness may be analogous to the blurriness effect produced by a Variational Autoencoder that is trained on pixel images. Depending on the use case of the model, smooth circular lines can be viewed as aesthetically pleasing and a desirable property.

While both conditional and unconditional models are capable of training on datasets of several classes, sketch-rnn is ineffective at modelling a large number of classes simultaneously. The samples generated will be incoherent, with different classes are shown in the same sketch.

== Applications and Future Work ==
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience

This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments.
When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.

The authors conclude by providing the following future directions to this work:
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.

It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.

The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking
sketch of the object composed of a minimal number of lines to be a more interesting problem.

Moreover, it would be interesting to see how varying loss will be represented as a drawing. Some exotic form of loss function may change the way that the network behaves, which can lead to various applications.

== Conclusion ==
The paper presents a methodology to model sketch drawings using recurrent neural networks. The sketch-rnn model that can encode and decode sketches, generate and complete unfinished sketches is introduced in this paper. In addition, Authors demonstrated how to both interpolate between latent spaces from a different class, and use it to augment sketches or generate similar looking sketches. Furthermore, the importance of enforcing a prior distribution on latent vector while interpolating coherent sketch generations is shown. Finally, a large sketch drawings dataset for future research work is created.

== Critique ==
This paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches. It is very exciting to read but there are still some aspect to improve.

* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment. The authors didn't present an evaluation of the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance. Training loss alone likely does not capture the quality of a sketch.

* Algorithm lacks comparison to the prior state of the art on standard metrics, which made the novelty unclear. Using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.

* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.

* The authors did not present better complexity and deeper mathematical analysis on the algorithms in the paper. It also does not include comparison using some more standard metrics compare to previous results. Therefore, it lacks some algorithmic contribution. It would be better to include some more formal analysis on the algorithmic side.

* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!

* As they said their model can become increasingly difficult to train on with increased size.

== References ==
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.
# T. White. Sampling Generative Networks. [https://arxiv.org/abs/1609.04468 ArXiv e-prints], September 2016.
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-12-04T00:14:57Z

H454chen: /* Related Work */

=Introduction=

It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.

With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].

In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix. This approach is efficient because it avoids searching over an exponential solution space of interaction candidates by making an approximation of hidden unit importance at the first hidden layer via all weights above and doing a 2D traversal of the input weight matrix.

Note that in this paper, we only consider one specific types of neural network, feedforward neural network. Based on the methodology discussed here, the authors suggest that we can build an interpretation method for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves. Two-way ANOVA has been a standard method of performing pairwise interaction detection that involves conducting hypothesis tests for each interaction candidate by checking each hypothesis with F-statistics (Wonnacott & Wonnacott, 1972). Additive Groves is another method that conducts individual tests for interactions and hence must face the same computational difficulties; however, it is special because the interactions it detects are not constrained to any functional form.
* Define all interaction forms of interest, then later finds the important ones.

- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations, providing a tool for visualizing live activations on each layer of a trained CNN, and another for visualizing "Regularized Optimization".)
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.
* Sum product networks, Hoifun Poon, Pedro Domingos (2011) It is a new deep architecture that provides clear semantics. In its core, it is a probabilistic model, with two types of nodes: Sum node and Product nodes. The sum nodes are trying to model the mixture of distributions and product node is trying to model joint distributions. It can be trained using gradient descent and other methods as well. The main advantage of the Sum-Product Network is that it has clear semantics, where people can interpret exactly how the network models make decisions. Therefore, it has better interpretability than most of the current deep architectures.

The approach in this paper is to extract non-additive interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.

Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer. The figure below illustrates an interaction within a fully connected feedforward neural network, where the box contains later layers in the network.

[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.
[[File:algorithm1.PNG|850px|center]]

=Cut-off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,

<center><math>
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)
</math></center>

From the above model, each of <math>g_i</math> and <math>g_i'</math> are Feed-Forward neural networks. <math>g_i(\cdot)</math> captures the main effects, while <math>g_i'(\cdot)</math> captures the interaction. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. In the experiments that the authors performed, all the networks which modelled feature interactions consisted of four hidden layers containing 140, 100, 60, and 20 units respectively. Whereas, all the individual univariate networks contained three hidden layers with each layer containing 10 units. All of these networks used ReLu activation and backpropagation for training. The MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, the authors study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, the authors are going to test on 10 synthetic functions as shown in table I.

[[File:synthetic.PNG|900px|center]]

The authors use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.

And the authors also reported the results of comparisons between the models. As you can see, neural network based models are performing better on average. Compare to the traditional methods like ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly capture the most influential pair-wise interactions.

=Highe-order interatcion detection=
The authors use their greedy interaction ranking algorithm to perform higher-order interactiondetection without an exponential search of interaction candidates.
[[File:higher-order_interaction_detection.png|700px|center]]

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.

For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks. Also, it was pointed out that the neural network weights heavily depend on L-1 regularized neural network training, but a group lasso penalty may work better.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

3. Greedy algorithm is implemented but nothing is mentioned about the speed of this algorithm which is definitely not fast. So, this has the potential to be a weak point of the study.

=Reference=

[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.

[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.

[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.

[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.

[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.

[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-12-03T23:51:52Z

H454chen: /* CRITIQUE */

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times. The authors present conclusive experimental evidence to prove the empirical benefits of decaying learning rate can be achieved by increasing the batch size instead.

== INTRODUCTION ==
Stochastic gradient descent (SGD) is the most widely used optimization technique for training deep learning models. The reason for this is that the minima found using this process generalizes well (Zhang et al., 2016; Wilson et al., 2017), but the optimization process is slow and time consuming as each parameter update corresponds to a small step towards the gooal. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model. This can be achieved by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> at constant learning rate which maximizes test set accuracy. This was observed empirically by Goyal et al., 2017 and used to train a ResNet-50 in under an hour with 76.3% validation accuracy on ImageNet dataset.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate. They show that this approach achieves almost equivalent model performance on the test set with the same number of training epochs but with remarkably fewer number of parameter updates. The strategy of increasing the batch size during training is in effect decreasing the scale of random fluctuations. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the latter decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum.

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where <math>C</math> represents cost function, <math>w</math> represents the parameters, and <math>\eta</math> represents the Gaussian random noise. Furthermore, they proved that noise scale <math>g</math> controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' decaying learning rates are empirically successful. To understand this, they note that introducing random fluctuations
whose scale falls during training is also a well-established technique in non-convex optimization; simulated annealing. The initial noisy optimization phase allows exploring a larger fraction of the parameter space without becoming trapped in local minima. Once a promising region of parameter space is located, the noise is reduced to fine-tune the parameters.

For more info: Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, simulated annealing may be preferable to alternatives such as gradient descent. [https://en.wikipedia.org/wiki/Simulated_annealing [Reference]]

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which leads to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have four challenges:

'''1.''' Additional epochs are needed to catch up with the accumulation.

'''2.''' Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

'''3.''' After this time, however, the accumulation cannot adapt to changes in the loss landscape.

'''4.''' In the early stage, a large batch size will lead to the instabilities.

It is thus recommended to keep a reduced learning rate for the first few epochs of training.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:''' . These demonstrate the equivalence between decreasing the learning rate and increasing the batch size.

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

If the learning rate itself must decay during training, then these schedules should show different learning curves (as a function of the number of training epochs) and reach different final test set accuracies. Meanwhile, if it is the noise scale which should decay, all three schedules should be indistinguishable.
[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for the test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

Here, the focus is on minimizing the number of parameter updates required to train a model. As shown above, the first step is to replace decaying learning rates by increasing batch sizes. Now, the authors show here that we can also increase the effective learning rate <math>\epsilon_{eff} = \epsilon/(1 − m) </math> at the start of training, while scaling the initial batch size <math>B \propto \epsilon_{eff} </math> . All experiments are conducted using SGD with momentum. There are 50000 images in the CIFAR-10 training set, and since the scaling rules only hold when <math>B << N </math> , we decided to set a maximum batch size <math>B_{max} </math>= 5120 .

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

The authors consider four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. Follows the implementation of Zagoruyko & Komodakis (2016).

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different sets of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation; the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is applying different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Wilson et al. (2017) argued that adaptive optimization methods tend to generalize less well than SGD and SGD with momentum (although
they did not include K-FAC in their study), while the authors' work reduces the gap in convergence speed.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing the batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase the learning rate and momentum parameter <math>m</math>, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in detail in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that all the methods use the hyper-parameters directly from previous works in the literature, and no additional hyper-parameter tuning was performed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if the learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results do not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in an experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

- It is proposed that we should compare learning rate decay with batch-size increase under the setting that total budget / number of training samples is fixed.

- While the paper demonstrated the proposed solution can decrease training time, it is not an entirely fair comparison because computations were distributed on a TPU POD. Suppose computing resource remains the same, the purposed method may possibly train slower.

== REFERENCES ==
# Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.
#L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.
#Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
#Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
#Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
#Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.
#Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
#Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
#Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.
#Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.
#Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
#James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.
#Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.
#Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.
#Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.
#Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
#Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
#Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.
#Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.
#Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.
#Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.
#Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Visual Reinforcement Learning with Imagined Goals

2018-11-27T04:58:26Z

H454chen: /* Related Work */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarised below. A Variational Auto Encoder (VAE) on the (left) is trained to learn a latent representation of images gathered during training time (center). These latent variables can then be used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png | 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al.[11] proposed time-varying models which require episodic setups. There are also other works such as Pinto et al.[12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] uses fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and
immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math>, can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> and <math>\gamma</math> is a discount factor. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling are usually used to get state-action-next-state data, (s,a,s′). However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. Images are noisy. A large amount of information in an image that may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space.

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

Unsupervised Neural Machine Translation

2018-11-27T04:52:31Z

H454chen: /* Critique */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist.

Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.
Then iteratively perform:
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.

Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back translated to the source language and compared with the original sentence.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.

= Methodology =

The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experiment with an alternative way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016](Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).

The tokens are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is language is different. This way a word which occurs in two different languages but has a different meaning in those languages would get a different vector in each of these languages despite being in the same vector space.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the of a language to decode the sentence into the correct order.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of backtranslation, however, does show large improvement on all tested cases.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-27T04:52:07Z

H454chen: Small editorial edits

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist.

Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.
Then iteratively perform:
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.

Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back translated to the source language and compared with the original sentence.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.

= Methodology =

The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experiment with an alternative way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016](Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).

The tokens are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is language is different. This way a word which occurs in two different languages but has a different meaning in those languages would get a different vector in each of these languages despite being in the same vector space.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the of a language to decode the sentence into the correct order.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of backtranslation, however, does show large improvement on all tested cases.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

Af pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-11-27T04:48:46Z

H454chen: /* Critique */

=Introduction=

In highly populated cities with many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem is in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a LSTM to classify whether a smartphone was either inside or outside a building using GPS, RSSI, and magnetometer sensor readings. The model is compared with baseline models like feed-forward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is an algorithm, which uses the output of the trained LSTM, to predict change in the barometric pressure of the smartphone from when it first entered the building against that of its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

The model does not rely on the external sensors placed inside the building, prior knowledge of the building, nor user movement behaviour. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations.

All the codes and data related to this article are available here[[https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data]]

=Related Work=

In general, previous work falls under two categories. The first category of methods is the classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is to achieve the high accuracy that the user's step size is needed, therefore heavily relying on pre-training to the specific users. In a real world application of this method, this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data, the authors developed an iOS application (called Sensory) that runs on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength, GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second, and each datum contained the different sensor measurements mentions earlier along with environment contexts like building floors, environment activity, city name, country name, and magnetic strength.

The data collection procedure for indoor-outdoor classifier was described as follows:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV for analysis. This procedure can start either outside or inside a building without loss of generality.

The following procedure generates data used to predict a floor change from the entrance floor to the end floor:
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9) Save data as CSV for analysis.

Their algorithm was used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure (<math>P</math>), GPS vertical accuracy (<math>GV</math>), GPS horizontal accuracy (<math>GH</math>), GPS speed (<math>S</math>), device RSSI level (<math>rssi</math>), and magnetometer total reading (<math>M</math>).

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= M = \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|800px]] </div>

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div>

The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level, they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

==Dataset==

In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.

As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning was being used, the actual floor level was recorded manually for the validation purposes only.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|650px]] </div>

The above chart shows the success that their system is able to achieve in the floor level prediction.

The performance was measured in terms of how many floors were travelled rather than the absolute floor number. Because different buildings might have their floors differently numbered. They used different m values in 2 tests. One applies the same m value across all building and the other one applied specific m values on different buildings. The result showed that this specification on m values hugely increased the accuracy.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate the floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. Also, the article's ideas are sometimes out of order and are repeated in cycles.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm is implemented; otherwise, the algorithm may provide the misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Moreover, authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

The proposed model could introduce privacy risks such as illegal surveillance of mobile phone user and private facilities.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Annotating Object Instances with a Polygon RNN

2018-11-27T04:32:50Z

H454chen: /* Critique */

Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']

The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].

= Background =

If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.

Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):

1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.

2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.

3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category.

4. Instance Segmentation (''This paper performs this''): The goal is to not only to assign pixel-level categorical labels, but to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.

[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]

== Motivation ==

Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.
A polygon is natural form of annotation. Current instant segmentations annotated by humans use polygons because it is a special representation of the image which can use small number of vertices instead of various pixels and makes it easy to incorporate user modifications.

[[File:polygon.png|600px|center]]

== Goal ==

Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming.

{| class=wikitable width=700 align=center
|Thus, the '''main goal''' of the paper is to enable '''semi-automatic''' annotation of object instances.
|}

Figure 2 demonstrates how the interface looks like for better clarity.

Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods. This approach works as the silhouette of an object is typically connected without holes.

{| class=wikitable width=900 align=center
|Thus, the authors suggest to adopt this same technique to annotate images using polygons, except they plan to automate the method and replace/reduce manual labeling. The '''intuition''' behind the success of this method is the '''sparse''' nature of these polygons that allow annotating of an object through a cluster of pixels rather than classification at the pixel-level.
|}

[[File:Annotating Object Instances Example.png | 450px|thumb|center|Figure 2: Given a bounding box, polygon outlining the the object instance inside the box is predicted. This approach is designed to facilitation annotation, and easily incorporates user corrections of points to improve the overall object’s polygon. ]]

= Related Works =

Some of the techniques used in semi-automatic annotation are as follows:

1. '''GrabCut''': In general, GrabCut is a method to separate the foreground and background of an image with minimal user interaction. Specifically, the user need only create a rectangular bounding box containing the foreground, and the algorithm will extract the object in the foreground. A major contribution of the paper is that labelling (of the object in the foreground) was not required, as the algorithm was able to identify where significant changes in colour pattern occurred. In this sense, it mimics automatic segmentation when combined with a Region Proposal Network.

[[File:GrabCut_Example.png | 450px|thumb|center|Figure 3: Illustration of GrabCut.]]

2. '''GrabCut + CNN''': Scribbles have also been used to train CNNs for semantic image segmentation.

3. '''Superpixels''': Superpixels in the form of small polygons where the color intensity within each superpixel is similar, to a certain threshold, have been used to provide a sparse representation of the large number of pixels in an image. However, the performance of this technique depends on the scale of the superpixels and hence sometimes merges small objects.

[[File:Superpixel_idea.jpg | 450px|thumb|center|Figure 4: Illustration of the superpixel idea.]]

= Model =

As an '''input''' to the model, an annotator or perhaps another neural network provides a bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.

The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.

The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.

== Architecture ==

There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.

[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 5: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type) ('''Note''': A possible point of confusion - the authors have only shown the layers of VGG16 architecture here that have the skip connections introduced).]]

1. '''CNN with skip connections''':

The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of features fused together in a tensor that can feed into the RNN (refer to Figure 5). Namely, the last max-pooling layer (''pool5'') present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passes through 2 pooling layers and 2 convolutional layers. Since, the features extracted after each operation are to be preserved and fused later on, at each of these four steps, the idea is to have a tensor with a common width of 512; so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features (helps to follow the object's boundaries) as well as boundary/semantic information about the instances (helps to identify the object). Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.

2. '''RNN - 2 Layer ConvLSTM'''

The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D received from CNN and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which
concatenates 4 features: the CNN feature representation of the image, one-hot encoding of the previous predicted vertex and the vertex predicted
from two time steps ago, as well as the one-hot encoding of the first predicted vertex.

The Convolutional LSTM computes the hidden state <math display = "inline">h_t</math> given the input <math display = "inline">x_t</math> based on the following equations:
<center>
<math display="block">
\begin{pmatrix}
i_t \\
f_t \\
o_t \\
g_t \\
\end{pmatrix}
= W_h * h_{t-1} + W_x * x_t + b
</math>

<math display="block">
c_t = \sigma(f_t) \bigodot c_{t-1} + \sigma(i_t) \bigodot tanh(g_t)
</math>

<math display="block">
h_t = \sigma(o_t) \bigodot tanh(c_t)
</math>
</center>
where <math display = "inline">i, f, o</math> denote the input, forget, and output gate, <math display = "inline">h</math> is the hidden state and <math display = "inline">c</math> is the cell state. Also, <math display = "inline">\sigma</math> denotes the sigmoid function, <math display = "inline">\bigodot</math> indicates an element-wise product and <math display = "inline">*</math> a convolution. <math display = "inline">W_h</math> denotes the hidden-to-state convolution kernel and <math display = "inline">W_x</math> the input-to-state convolution kernel.

The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, as polygon is a circle, any vertex of a polygon can be used as a starting point. Therefore the authors treat the starting point as special, and this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict vertices of the polygon. The boundaries and vertices prediction are being treated as binary classification problem in each cell in the output grid. This CNN is trained separately. Here, <math display = "inline">y_t</math> denotes the one-hot encoding of the vertex and is the output at time step t.

== Training ==

The training of the model is done as follows:

1. Cross-entropy is used for the RNN loss function. To avoid over-penalizing of mispredictions, non-zero probability mass are assigned to locations which are within a distance of 2 in D × D output grid.

2. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10)

3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function.

The reported time for training is one day on a Nvidia Titan-X GPU.

The resolution of the polygon is 28 x 28, based on the downsampling factor and ConvLSTM resolution. They simplified the polygon by removing vertices on the grid line and the same vertices that fall in the same grid. They also randomly flipped images, enlarged original bounding boxes and randomly selected the starting vertex of the polygon notation as their data augmentation process.

== Importance of Human Annotator in the Loop ==

The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". Note that this is only possible due to the adoption of the RNN architecture i.e. the inherent nature of the RNN to accept previous outputs allows incorporation of the user's judgement. The typical inference time as quoted by the paper is 250ms per object.

= Results =

== Evaluation Metrics ==

The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. There are two metrics used for evaluation:

1. '''IoU''': The standard Intersection over Union (IoU) measure is used for comparison. In add The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.

2. '''Number of Clicks''': To evaluate the speed up factor, the checkerboard distance is used to measure the distance between the ground truth (GT) and the output of the Polygon RNN. A set of distance thresholds are set <math display = "inline">T ∈ [1,2,3,4]</math> and if the distance exceeds the particular threshold, the correction is made by an annotator to match the GT and the '''Number of Clicks''' is used to evaluate the speed up factor.

== Baseline Techniques ==

1. '''SharpMask''': a 50 layer ResNet considered as the state of the art annotation method.

2. '''DeepMask''': a build-up on the 50 layer ResNet with an addition of another CNN.

3. '''Dilation10''': another simple technique using purely convolutional operations.

4. '''SquareBox''': a simple technique where an entire bounding box is labeled as an object

== Quantitative Results ==

We report the IoU metric in Table
1. The Polygon RNN method outperforms the baselines in 6 out of the 8 categories and has a mean IoU greater than all of the baselines. Particularly, in the car, person, and rider categories, a 12%, 7%, and 6% higher performance than SharpMask is achieved.

[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]

In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.

[[File:Table_0_Neel.JPG | 800px|thumb|center|Table 2: IoU performance on Cityscapes data with annotator intervention.]]

The method also works well with other datasets such as KITTI:

[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 3: IoU performance on KITTI data.]]

== Effect of object size ==
In Fig. 4, we see how our model performs w.r.t baselines on different instance sizes. For small instances, our model performs significantly better than the baselines. For larger objects, the baselines have an advantage due to the larger output resolution.

[[File:IoU_vs_size_of_instance.PNG | 500px|thumb|center|Fig 4: IoU_vs_size_of_instance.]]

== Qualitative Results ==

In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.

<gallery widths=500px heights=500px perrow=2 mode="packed">
File:Figure_3_Neel.JPG|Figure 6: Qualitative results: comparison with human annotator.|alt=alt language
File:Figure_4_Neel.JPG|Figure 7: Qualitative results: comparison with human annotator.|alt=alt language
</gallery>

=Conclusion=

The important conclusions from this paper are:

1. The paper presented a powerful generic annotation tool for modelling complex annotations as a simple polygon that works on different unseen datasets.

2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself (speed-up factor of 4.74).

3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.

4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.

5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.

=Critique=

1. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.

2. Given that this model uses the VGG16 architecture compared to the 50 layer ResNet in SharpMask, this method is quite efficient.

3. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.

4. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.

5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.

6. Compared to other models, the model was shown to not perform as well for larger objects (see table 3). This is likely due to the fact that vertex location determination is done in a highly compressed (28x28) representation compared to the input image(224x224). For larger objects, bounding boxes are larger. Each vertex represents many pixels. When up-converted back to the input image/bounding box size these may lead to errors especially when considering a very precise evaluation metric (intersection over union) is used. Potentially, the results can be improved by considering a higher resolution for the internal representation or one that scales with the size of the bounding.

7. While the model outperforms the baseline for certain categories of object, it is surprising that it underperforms in categories such as 'bus' and 'train'. With human annotators in the loop, one would expect the model to outperform in all categories.

=Code=
# [https://github.com/AlexMa011/pytorch-polygon-rnn] (unofficial)
# Code for an updated version of the model is available at [https://github.com/fidler-lab/polyrnn-pp] (official)

Zero-Shot Visual Imitation

2018-11-27T03:48:58Z

H454chen: /* Learning the Goal-Conditioned Skill Policy (GSP) */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from. In this paper, an alternative
paradigm is pursued wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss.
Videos, models, and more details are available at [[https://pathak22.github.io/zeroshot-imitation/]].

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images, instead of observation-action pairs. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'', and is learned by re-labeling states that the agent visited as goals and the actions the agent taken as prediction targets via self-supervised way. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A major challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'', which essentially says that reaching the goal is more important than how it is reached. First, a forward model that predicts the next observation from the given action and current observation is learned. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss does not inadvertently penalize actions that are ''consistent'' with the ground-truth action, even though the actions are not exactly the same (but lead to the same next state).

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like forward-consistent loss, issues related to the number of steps needed to reach a certain goal become of interest since different goals require different number of steps. To address this, the paper pairs the GSP with a goal recognizer (as an optimizer) to determines whether the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram (d) showing the forward-consistent loss proposed in this paper.

The paper refers to this method as zero-shot, as the agent never has access to expert actions regardless of being in the training or task demonstration phase. This is different from one-shot imitation learning, where agents have full knowledge of actions and expert demos during the training phase. The agent learns to imitate instead of learning by imitation. The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation, and a series of navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

Imitation Learning: The two main threads are behavioral cloning and inverse reinforcement learning. For recent work in imitation learning, it required the expert actions to expert actions. Compared with this paper, it does not need this.

Visual Demonstration: Several papers focused on relaxing this supervision to visual observations alone and the end-to-end learning improved results.

Forward/Inverse Dynamics and Consistency: Forward dynamics model for planning actions has been learned but there is not consistent optimizer between the forward and inverse dynamics.

Goal Conditioning: In this paper, systems work from high-dimensional visual inputs instead of knowledge of the true states and do not use a task reward during training.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment. This exploration data is used to learn the GSP policy.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The learned GSP policy (<math display="inline">π</math>) takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs a sequence of actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> starting from the current observation <math display="inline">x_i</math>. The states (observations) <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and need not be consecutive. Given the start and stop states, the number of actions <math display="inline">K</math> is also known. <math display="inline">π</math> can be though of as a deep network with parameters <math display="inline">θ_π</math>.

At test time, the expert demonstrates a task from which the agent captures a sequence of observations. This set of images is denoted by <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math>. The sequence needs to have at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses its learned policy to start from initial state <math display="inline">x_0</math> and generate actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to follow the observations in <math display="inline">D</math>.

The agent does not have access to the sequence of actions performed by the expert. Hence, it must use the observations to determine if it has reached the goal. A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the current goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. Then after reaching sufficiently close to <math display="inline">x_1^d</math>, the agent sets <math display="inline">x_2^d</math> as the goal and executes actions. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

In this section, first, the one-step version GSP policy is described. Next, it is extend it to the multi-step version.

A one-step trajectory can be described as <math display="inline">(x_t; a_t; x_{t+1})</math>. Given <math display="inline">(x_t, x_{t+1})</math> the GSP policy estimates an action, <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math>. During training, cross-entropy loss is used to learn GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">a_t</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted actions respectively. The conditional distribution <math display="inline">p</math> is not readily available so it needs to be empirically approximated using the data. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math>; given a specific input, the network outputs a single output. However, in this problem multiple actions can lead to the same output. Multiple outputs given a single input can be modeled using a variation auto-encoder. However, the authors use a different approach explained in sections 2.2-2.4 and in the following sections.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (the observation from executing the action predicted by GSP <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients (for actions that result in the same next observation) and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss, we need a differentiable "forward dynamics" model that can reliably predict results of an action. The forward dynamics <math display="inline">f</math> are learned from the data by another model. Given an observation and the action performed, <math display="inline">f</math> predicts the next observation, <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency with the GSP network. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust. This paper incorporates this by making the GSP predict feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math> rahter than the input space.

Learning the two models <math>θ_π,θ_f</math> simultaneously from scratch can cause noisier gradient updates. This is addressed by pre-training the forward model with the first term and GSP separately by blocking gradient flow. Fine-tuning is then done with <math>θ_π,θ_f</math> jointly.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not. Additionally, a maximum number of iterations is also used to prevent the sequence of actions from getting too long.

The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss.

The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forwarding consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image has no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration
has no overlap with its current observation. Even under this condition, the robot is able to move closer
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space.

Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work. Future work would also explore how multiple tasks can be combined into a single model, where different tasks might come from different contexts. Finally, it would be exciting to explore explicit handling of domain shift in future work, so as to handle large differences in embodiment and learn skills directly from videos of human demonstrators obtained, for example, from the Internet.

==Critique==
1. The paper is well written and could be easily understood. In addition, the experimental evaluations are promising. Also, the proposed method is a novel and interesting so that it could be used as an alternative to pure RL.

2. In the paper, the authors didn't mention clearly why zero-shot imitation instead of a trained reinforcement learning model should be used. So, they need to provide more details about this issue.

3. It is surprised that experimental evaluations on real robots. However, the scalability of this paper is not demonstrated, how to extend it to higher dimensional action spaces and whether it is expensive in high dimensional action spaces.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In ICML, 2017.

Zero-Shot Visual Imitation

2018-11-27T03:48:01Z

H454chen: /* Learning to Imitate Without Expert Supervision */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from. In this paper, an alternative
paradigm is pursued wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss.
Videos, models, and more details are available at [[https://pathak22.github.io/zeroshot-imitation/]].

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images, instead of observation-action pairs. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'', and is learned by re-labeling states that the agent visited as goals and the actions the agent taken as prediction targets via self-supervised way. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A major challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'', which essentially says that reaching the goal is more important than how it is reached. First, a forward model that predicts the next observation from the given action and current observation is learned. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss does not inadvertently penalize actions that are ''consistent'' with the ground-truth action, even though the actions are not exactly the same (but lead to the same next state).

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like forward-consistent loss, issues related to the number of steps needed to reach a certain goal become of interest since different goals require different number of steps. To address this, the paper pairs the GSP with a goal recognizer (as an optimizer) to determines whether the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram (d) showing the forward-consistent loss proposed in this paper.

The paper refers to this method as zero-shot, as the agent never has access to expert actions regardless of being in the training or task demonstration phase. This is different from one-shot imitation learning, where agents have full knowledge of actions and expert demos during the training phase. The agent learns to imitate instead of learning by imitation. The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation, and a series of navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

Imitation Learning: The two main threads are behavioral cloning and inverse reinforcement learning. For recent work in imitation learning, it required the expert actions to expert actions. Compared with this paper, it does not need this.

Visual Demonstration: Several papers focused on relaxing this supervision to visual observations alone and the end-to-end learning improved results.

Forward/Inverse Dynamics and Consistency: Forward dynamics model for planning actions has been learned but there is not consistent optimizer between the forward and inverse dynamics.

Goal Conditioning: In this paper, systems work from high-dimensional visual inputs instead of knowledge of the true states and do not use a task reward during training.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment. This exploration data is used to learn the GSP policy.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The learned GSP policy (<math display="inline">π</math>) takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs a sequence of actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> starting from the current observation <math display="inline">x_i</math>. The states (observations) <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and need not be consecutive. Given the start and stop states, the number of actions <math display="inline">K</math> is also known. <math display="inline">π</math> can be though of as a deep network with parameters <math display="inline">θ_π</math>.

At test time, the expert demonstrates a task from which the agent captures a sequence of observations. This set of images is denoted by <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math>. The sequence needs to have at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses its learned policy to start from initial state <math display="inline">x_0</math> and generate actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to follow the observations in <math display="inline">D</math>.

The agent does not have access to the sequence of actions performed by the expert. Hence, it must use the observations to determine if it has reached the goal. A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the current goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. Then after reaching sufficiently close to <math display="inline">x_1^d</math>, the agent sets <math display="inline">x_2^d</math> as the goal and executes actions. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

in this section, first, the one-step version GSP policy is described. Next, it is extend it to the multi-step version.

A one-step trajectory can be described as <math display="inline">(x_t; a_t; x_{t+1})</math>. Given <math display="inline">(x_t, x_{t+1})</math> the GSP policy estimates an action, <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math>. During training, cross-entropy loss is used to learn GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">a_t</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted actions respectively. The conditional distribution <math display="inline">p</math> is not readily available so it needs to be empirically approximated using the data. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math>; given a specific input, the network outputs a single output. However, in this problem multiple actions can lead to the same output. Multiple outputs given a single input can be modeled using a variation auto-encoder. However, the authors use a different approach explained in sections 2.2-2.4 and in the following sections.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (the observation from executing the action predicted by GSP <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients (for actions that result in the same next observation) and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss, we need a differentiable "forward dynamics" model that can reliably predict results of an action. The forward dynamics <math display="inline">f</math> are learned from the data by another model. Given an observation and the action performed, <math display="inline">f</math> predicts the next observation, <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency with the GSP network. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust. This paper incorporates this by making the GSP predict feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math> rahter than the input space.

Learning the two models <math>θ_π,θ_f</math> simultaneously from scratch can cause noisier gradient updates. This is addressed by pre-training the forward model with the first term and GSP separately by blocking gradient flow. Fine-tuning is then done with <math>θ_π,θ_f</math> jointly.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not. Additionally, a maximum number of iterations is also used to prevent the sequence of actions from getting too long.

The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss.

The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forwarding consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image has no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration
has no overlap with its current observation. Even under this condition, the robot is able to move closer
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space.

Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work. Future work would also explore how multiple tasks can be combined into a single model, where different tasks might come from different contexts. Finally, it would be exciting to explore explicit handling of domain shift in future work, so as to handle large differences in embodiment and learn skills directly from videos of human demonstrators obtained, for example, from the Internet.

==Critique==
1. The paper is well written and could be easily understood. In addition, the experimental evaluations are promising. Also, the proposed method is a novel and interesting so that it could be used as an alternative to pure RL.

2. In the paper, the authors didn't mention clearly why zero-shot imitation instead of a trained reinforcement learning model should be used. So, they need to provide more details about this issue.

3. It is surprised that experimental evaluations on real robots. However, the scalability of this paper is not demonstrated, how to extend it to higher dimensional action spaces and whether it is expensive in high dimensional action spaces.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In ICML, 2017.

Wasserstein Auto-encoders

2018-11-23T06:41:42Z

H454chen: /* Future Work */

The first version of this work was published in 2017 and this version (which is the third revision) is presented in ICLR 2018. Source code for the first version is available [https://github.com/tolstikhin/wae here]

=Introduction=
Early successes in the field of representation learning were based on supervised approaches, which used large labeled datasets to achieve impressive results. On the other hand, popular unsupervised generative modeling methods mainly consisted of probabilistic approaches focusing on low dimensional data. In recent years, there have been models proposed which try to combine these two approaches. One such popular method is called variational auto-encoders (VAEs). VAEs are theoretically elegant but have a major drawback of generating blurry sample images when used for modeling natural images. In comparison, generative adversarial networks (GANs) produce much sharper sample images but have their own list of problems which includes a lack of encoder, harder to train, and the "mode collapse" problem. Mode collapse problem refers to the inability of the model to capture all the variability in the true data distribution. Currently, there has been a lot of activity around finding and evaluating numerous GANs architectures and combining VAEs and GANs but a model which combines the best of both GANs and VAEs is yet to be discovered.

The work done in this paper builds upon the theoretical work done in [4]. The authors tackle generative modeling using optimal transport (OT). The OT cost is defined as the measure of distance between probability distributions. One of the features of OT cost which is beneficial is that it provides much weaker topology when compared to other costs including f-divergences which are associated with the original GAN algorithms.
This particular feature is crucial in applications where the data is usually supported on low dimensional manifolds in the input space. This results in a problem with the stronger notions of distances such as f-divergences as they often max out and provide no useful gradients for training. In comparison, the OT cost has been claimed to behave much more nicely [5, 8]. Despite the preceding claim, the implementation, which is similar to GANs, still requires the addition of a constraint or a regularization term into the objective function.

==Original Contributions==
Let <math>P_X</math> be the true but unknown data distribution, <math>P_G</math> be the latent variable model specified by the prior distribution <math>P_Z</math> of latent codes <math>Z \in \mathcal{Z}</math> and the generative model <math>P_G(X|Z)</math> of the data points <math>X \in \mathcal{X}</math> given <math>Z</math>. The goal in this paper is to minimize <math>OT\ W_c(P_X, P_G)</math>.

The main contributions are given below:

* A new class of auto-encoders called Wasserstein Auto-Encoders (WAE). WAEs minimize the optimal transport <math>W_c(P_X, P_G)</math> for any cost function <math>c</math>. As is the case with VAEs, WAE objective function is also made up of two terms: the c-reconstruction cost and a regularizer term <math>\mathcal{D}_Z(P_Z, Q_Z)</math> which penalizes the discrepancy between two distributions in <math>\mathcal{Z}: P_Z\ and\ Q_Z</math>. <math>Q_Z</math> is a distribution of encoded points, i.e. <math>Q_Z := \mathbb{E}_{P_X}[Q(Z|X)]</math>. Note that when <math>c</math> is the squared cost and the regularizer term is the GAN objective, WAE is equivalent to the adversarial auto-encoders described in [2].

* Experimental results of using WAE on MNIST and CelebA datasets with squared cost <math>c(x, y) = ||x - y||_2^2</math>. The results of these experiments show that WAEs have the good features of VAEs such as stable training, encoder-decoder architecture, and a nice latent manifold structure while simultaneously improving the quality of the generated samples.

* Two different regularizers. One based on GANs and adversarial training in the latent space <math>\mathcal{Z}</math>. The other one is based on something called "Maximum Mean Discrepancy" which known to have high performance when matching high dimensional standard normal distributions. The second regularizer also makes the problem fully adversary-free min-min optimization problem.

* The final contribution is the mathematical analysis used to derive the WAE objective function. In particular, the mathematical analysis shows that in the case of generative models, the primal form of <math>W_c(P_X, P_G)</math> is equivalent to a problem which deals with the optimization of a probabilistic encoder <math>Q(Z|X)</math>

=Proposed Method=
The method proposed by the authors uses a novel auto-encoder architecture to minimize the optimal transport cost <math>W_c(P_X, P_G)</math>. In the optimization problem that follows, the decoder tries to accurately reconstruct the data points as measured by the cost function <math>c</math>. The encoder tries to achieve the following two conflicting goals at the same time: (1) try to match the distribution of the encoded data points <math>Q_Z := \mathbb{E}_{P_X}[Q(Z|X)]</math> to the prior distribution <math>P_Z</math> as measured by the divergence <math>\mathcal{D}_Z(P_Z, Q_Z)</math> and, (2) make sure that the latent space vectors encoded contain enough information so that the reconstruction of the data points are of high quality. The figure below illustrates this:

[[File:ka2khan_figure_1.png|800px|thumb|center|Figure 1]]

Figure 1: Both VAE and WAE have objectives which are composed of two terms. The two terms are the reconstruction cost and the regularizer term which penalizes the divergence between <math>P_Z</math> and <math>Q_Z</math>. VAE forces <math>Q(Z|X = x)</math> to match <math>P_Z</math> for the the different training examples drawn from <math>P_X</math>. As shown in the figure above, every red ball representing <math>Q_z</math> is forced to match <math>P_Z</math> depicted as whitish triangles. This causes intersection among red balls and results in reconstruction problems. On the other hand, WAE coerces the mixture <math>Q_Z := \int{Q(Z|X)\ dP_X}</math> to match <math>P_Z</math> as shown in the figure above. This provides a better chance of the encoded latent codes to have more distance between them. As a consequence of this, higher reconstruction quality is achieved.

==Preliminaries and Notations==
Authors use calligraphic letters to denote sets (for example, <math>\mathcal{X}</math>), capital letters for random variables (for example, <math>X</math>), and lower case letters for the values (for example, <math>x</math>). Probability distributions are are also denoted with capital letters (for example, <math>P(X)</math>) and the corresponding densities are denoted with lowercase letter (for example, <math>p(x)</math>).

Several measure of difference between probability distributions are also used by the authors. These include f-divergences given by <math>D_f(p_X||p_G) := \int{f(\frac{p_X(x)}{p_G(x)})p_G(x)}dx\ \text{where}\ f:(0, \infty) → \mathcal{R}</math> is any convex function satisfying <math>f(1) = 0</math>. Other divergences used include KL divergence (<math>D_{KL}</math>) and Jensen-Shannon (<math>D_{JS}</math>) divergences.

==Optimal Transport and its Dual Formations==

A rich class of measure of distances between probability distributions is motivated by the optimal transport problem. One such formulation of the optimal transport problem is the Kantovorich's formulation given by:

<math>
W_c(P_X, P_G) := \underset{\Gamma \in \mathcal{P}(X \sim P_X ,Y \sim P_G)}{inf} \mathbb{E}_{(X,Y) \sim \Gamma}[c(X,Y)],
\text{where} \ c(x, y): \mathcal{X} \times \mathcal{X} → \mathcal{R_{+}}
</math>

is any measurable cost function and <math>\mathcal{P}(X \sim P_X, Y \sim P_G)</math> is a set of all joint distributions of (X, Y) with marginals <math>P_X\ \text{and}\ P_G</math> respectively.

A particularly interesting case is when <math>(\mathcal{X}, d)</math> is metric space and <math>c(x, y) = d^p(x, y)\ \text{for}\ p ≥ 1</math>. In this case <math>W_p</math>, the <math>p-th</math> root of <math>W_c</math>, is called the p-Wasserstein distance.

When <math>c(x, y) = d(x, y)</math> the following Kantorovich-Rubinstein duality holds:

<math>W_1(P_X, P_G) = \underset{f \in \mathcal{F}_L}{sup} \mathbb{E}_{X \sim P_x}[f(X)] = \mathbb{E}_{Y \sim P_G}[f(Y)]</math>
where <math>\mathcal{F}_L</math> is the class of all bounded 1-Lipschitz functions on <math>(\mathcal{X}, d)</math>.

==Application to Generative Models: Wasserstein auto-encoders==
The intuition behind modern generative models like VAEs and GANs is that they try to minimize specific distance measures between the data distribution <math>P_X</math> and the model <math>P_G</math>. Unfortunately, with the current knowledge and tools, it is usually really hard or even impossible to calculate most of the standard discrepancy measures especially when <math>P_X</math> is not known and <math>P_G</math> is parametrized by deep neural networks. Having said that, there are certain tricks available which can be employed to get around that difficulty.

For KL-divergence <math>D_{KL}(P_X, P_G)</math> minimization, or equivalently the marginal log-likelihood <math>E_{P_X}[log_{P_G}(X)]</math> maximization, one can use the famous variational lower bound which provides a theoretically grounded framework. This has been used quite successfully by the VAEs. In the general case of minimizing f-divergence <math>D_f(P_X, P_G)</math>, using its dual formulation along with f-GANs and adversarial training is viable. Finally, OT cost <math>W_c(P_X, P_G)</math> can be minimized by using the Kantorovich-Rubinstein duality expressed as an adversarial objective. The Wasserstein-GAN implement this idea.

In this paper, the authors focus on the latent variable models <math>P_G</math> given by a two step procedure. First, a code <math>Z</math> is sampled from a fixed distribution <math>P_Z</math> on a latent space </math>\mathcal{Z}</math>. Second step is to map <math>Z</math> to the image <math>X \in \mathcal{X} = \mathcal{R}^d</math> with a (possibly random) transformation. This gives us a density of the form

<math>
p_G(x) := \int\limits_{\mathcal{Z}}{p_G(x|z)p_z(z)}dz,\ \forall x \in \mathcal{X},
</math>

provided all the probablities involved are properly defined. In order to keep things simple, the authors focus on non-random decoders, i.e., the generative models <math>P_G(X|Z)</math> deterministically map <math>Z</math> to <math>X = G(Z)</math> using a fixed map <math>G: \mathcal{Z} → \mathcal{X}</math>. Similar results hold for the random decoders as shown by the authors in the appendix B.1.

Working under the model defined in the preceding paragraph, the authors find that OT cost takes a much simpler form as the transportation plan factors through the map <math>G:</math> instead of finding a coupling <math>\Gamma</math> between two random variables in the <math>\mathcal{X}</math> space, one given by the distribution <math>P_X</math> and the other by the the distribution <math>P_G</math>, it is enough to find a conditional distribution <math>Q(Z|X)</math> such that its <math>Z</math> marginal, <math>Q_Z)Z) := \mathbb{E}_{X \sim P_X}[Q(Z|X)]</math> is the same as the prior distribution <math>P_Z</math>. This is formalized by the theorem given below. The theorem given below was proven in [4] by the authors.

'''Theorem 1.''' For <math>P_G</math> defined as above with deterministic <math>P_G(X|Z)</math> and any function <math>G:\mathcal{Z} → \mathcal{X}</math>

<math>
\underset{\Gamma \in \mathcal{P}(X \sim P_X ,Y \sim P_G)}{inf} \mathbb{E}_{(X,Y) \sim \Gamma}[c(X,Y)] = \underset{Q: Q_Z = P_Z}{inf} \mathbb{E}_{P_X} \mathbb{E}_{Q(Z|X)}[c(X, G(Z))]
</math>

where <math>Q_Z</math> is the marginal distribution of <math>Z</math> when <math>X \sim P_X</math> and <math>Z \sim Q(Z|X)</math>.

According to the authors, the result above allows optimization over random encoders <math>Q(Z|X)</math> instead of optimizing overall couplings of <math>X</math> and <math>Y</math>. Both problems are still constrained. To find a numerical solution, the authors relax the constraints on <math>Q_Z</math> by adding a regularizer term to the objective. This gives them the WAE objective:

<math>
D_{WAE}(P_X, P_G) := \underset{Q(Z|X) \in \mathcal{Q}}{inf} \mathbb{E}_{P_X} \mathbb{E}_{Q(Z|X)}[c(X, G(Z))] + \lambda \cdot \mathcal{D}_Z(Q_Z, P_Z)
</math>

where <math>\mathcal{Q}</math> is any nonparametric set of probabilistic encoders, <math>\mathcal{D}_Z</math> is an arbitrary measure of distance between <math>Q_Z</math> and <math>P_Z</math>, and <math>\lambda > 0</math> is a hyperparameter. As is the case with the VAEs, the
authors propose using deep neural networks to parameterize both encoders <math>Q</math> and decoders <math>G</math>. Note that, unlike VAEs, WAE allows for non-random encoders deterministically mapping their inputs to their latent codes.

The authors propose two different regularizers <math>\mathcal{D}_Z(Q_Z, P_Z)</math>

===GAN-based <math>\mathcal{D}_z</math>===
One of the option is to use <math>\mathcal{D}_Z(Q_Z, P_Z) = \mathcal{D}_{JS}(Q_Z, P_Z)</math> along with adversarial training for estimation. In particular, the discriminator (adversary) is used in the latent space <math>\mathcal{Z}</math> to classify "true" points sampled for <math>P_X</math> and "fake" ones samples from <math>Q_Z</math>. This leads to the WAE-GAN as described in Algorithm 1 listed below. Even though WAE-GAN still uses max-min optimization, one positive feature is that it moves the adversary from the input (pixel) space <math>\mathcal{X}</math> to the latent space <math>\mathcal{Z}</math>. Additionally, the true latent space distribution <math>P_Z</math> might have a nice shape with a single mode (for a Gaussian prior), making the task of matching much easier as opposed to matching an unknown, complex, and possibly multi-modal distributions which is usually the case in GANs. This leads to the second penalty.

===MMD-based <math>\mathcal{D}_z</math>===
For a positive-definite reproducing kernel <math>k: \mathcal{Z} \times \mathcal{Z} → \mathcal{R}</math>, the maximum mean discrepancy (MMD) is defined as

<math>
MMD_k(P_Z, Q_Z) = \left \Vert \int \limits_{\mathcal{Z}} {k(z, \cdot)dP_Z(z)} - \int \limits_{\mathcal{Z}} {k(z, \cdot)dQ_Z(z)} \right \|_{\mathcal{H}_k}
</math>,

where <math>\mathcal{H}_k</math> is the RKHS (reproducing kernel Hilbert space) of real-valued functions mappings <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. If <math>k</math> is characteristi then <math>MMD_k</math> defines a metric and can be used as a distance measure. The authors propose to use <math>\mathcal{D}_Z(P_Z, Q_Z) = MMD_k(P_Z, Q_Z)</math>. MMD also have an unbiased U-statistic estimator which can be used alongwith stochastic gradient descent (SGD) methods. This gives us WAE-MMD as described in the Algorithm 2 listed below. Note that MMD is known to perform well when matching high dimensional standard normal distributions, so it is expected that this penalty will work well when the prior <math>P_Z</math> is Gaussian.

[[File:ka2khan_figure_2.png|800px|thumb|center|Algorithms]]

=Related Work=
==Literature on auto-encoders==
Classical unregularized auto-encoders have an objective function which only tries to minimize the reconstruction cost. This results in distinct data points being encoded into distinct zones distributed chaotically across the latent space <math>\mathcal{Z}</math>. The latent space <math>\mathcal{Z}</math> in this scenario contains huge "holes" for which the decoder <math>P_G(X|Z)</math> has never been trained. In general, the encoder trained this way do not provide terribly useful representations and sampling from the latent space <math>\mathcal{Z}</math> becomes a difficult task [12].

VAEs [1] minimize the KL-divergence <math>D_{KL}(P_X, P_G)</math> which consists of the reconstruction cost and the regularizer <math>\mathbb{E}_{P_X}[D_{KL}(Q(|X), P_Z)]</math>. The regularizer penalizes the difference in the encoded training images and the prior <math>P_Z</math>. But this penalty still does not guarantee that the overall encoded distribution matches the prior distribution as WAE does. In addition, VAEs require a non-degenerate (i.e. non-deterministic) Gaussian encoders along with random decoders. Another paper [11] later, proposed a method which allows the use of non-Gaussian encoders with VAEs. In the meanwhile, WAE minimizes <math>W_{c}(P_X, P_G)</math> and allows probabilistic and deterministic encoder and decoder pairs.

When parameters are appropriately defined, WAE is able to generalize AAE in two ways: it can use any cost function in the input space and use any discrepancy measure <math>D_Z</math> in latent space <math>Z</math> other than the adversarial one.

There has been work done on regularized auto-encoders called InfoVAE [14], which has objective similar to [4] but using different motivations and arguments.

WAEs explicitly define the cost function <math>c(x,y)</math>, whereas VAEs rely on an implicitly through a negative log likelihood term. It theoretically can induce any arbitrary cost function, but in practice can require an estimation of the normalizing constant that can be different for values of <math>z</math>.

==Literature on OT==
[15] provides methods for computing OT cost for large-scale data using SGD and sampling. The WGAN [5] proposes a generative model which minimizes 1-Wasserstein distance <math>W_1(P_X, P_G)</math>. The WGAN algorithm does not provide an encoder and cannot be easily applied to any arbitrary cost <math>W_C</math>. The model proposed in [5] uses the dual form, in contrast, the model proposed in this paper uses the primal form. The primal form allows the use of any arbitrary cost function <math>c</math> and naturally, comes with an encoder.

In order to compute <math>W_c(P_X, P_G)</math> or <math>W_1(P_X, P_G)</math>, the model needs to handle various non-trivial constraints, various methods has be proposed in the literature ([5], [2], 8[], [16], [15], [17], [18]) to avoid this difficulty .

==Literature on GANs==
A lot of the GAN variations which have been proposed in the literature come without an encoder. Examples include WGAN and f-GAN. These models are deficient in cases where a reconstruction of latent space is needed to use the learned manifold.

There have been numerous models proposed in the literature which try to combine the adversarial training of GANs with auto-encoder architectures. Some examples are [19], [20], [21], and [22]. There has also been work done in which reproducing kernels have been used in the context of GANS ([23], [24]).

=Experiments=
Experiments were used to empirically evaluate the proposed WAE model. The authors conducted experiments using the following two real-world datasets: (1) MNIST [27] made up of 70k images, and (2) CelebA [28] consisting of approximately 203k images.

The main evaluation criteria were to see if the WAE model can simultaneously achieve:

<ol>
<li>accurate reconstruction of the data points</li>
<li>resonable geometry of the latent manifold</li>
<li>generation of high quality random samples</li>
</ol>

For the model to generalize well (1) and (2) should be met on both the training and test data set.

The proposed model achieve reasonably good results as highlighted in the figures given below:

[[File:ka2khan_figure_3.png|800px|thumb|center|Using CelebA dataset]]

[[File:ka2khan_figure_4.png|800px|thumb|center|Using CelebA dataset, FID (Fréchet Inception Distance
[32]): smaller is better, sharpness: larger is better]]

=Conclusion=
The authors proposed a new class of algorithms for building a generative model called Wasserstein Autoencoders based optimal transport cost. They related the newly proposed model to the existing probabilistic modeling techniques. They empirically evaluated the proposed models using two real-world datasets. They compared the results obtained using their proposed model with the results obtained using VAEs on the same dataset to show that the proposed models generate sample images of higher quality in addition to being easier to train and having good reconstruction quality of the data points.

The authors claim that in future work, they will further explore the criteria for matching the encoding distribution <math>Q_Z</math> to the prior distribution <math>P_Z</math>, evaluate whether it is feasible to adversarially train the cost function <math>c</math>in the input space <math>\mathcal{X}</math>, and a theoretical analysis of the dual-formations for WAE-GAN and WAE-MMD.

=Future Work=
Following the work of this paper, another generative model was introduced by [34] that is based on the concept of optimal transport. Optimal transport is basically the distances between probability distributions by transporting one of the distributions to the other (and hence the name of optimal transport). Then, a new simple model called "Sliced-Wasserstein Autoencoders" (SWAE) is presented, which is easily implemented, and provides the capabilities of Wasserstein Autoencoders.

([https://openreview.net/forum?id=HkL7n1-0b]) The results from MNIST and CelebA datasets look convincing, though could include additional evaluation to compare the adversarial loss with the straightforward MMD metric and potentially discuss their pros and cons. In some sense, given the challenges in evaluating and comparing closely related auto-encoder solutions, the authors could design demonstrative experiments for cases where Wassersterin distance helps and maybe its potential limitations.

=References=
[1] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.

[2] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR, 2016.

[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.

[4] O. Bousquet, S. Gelly, I. Tolstikhin, C. J. Simon-Gabriel, and B. Schölkopf. From optimal transport to generative modeling: the VEGAN cookbook, 2017.

[5] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017.

[6] C. Villani. Topics in Optimal Transportation. AMS Graduate Studies in Mathematics, 2003.

[7] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.

[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Domoulin, and A. Courville. Improved training of wasserstein GANs, 2017.

[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.

[10] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.

[11] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks, 2017.

[12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, 35, 2013.

[13] M. D. Hoffman and M. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In NIPS Workshop on Advances in Approximate Bayesian Inference, 2016.

[14] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information maximizing variational autoencoders, 2017.

[15] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pages 3432–3440, 2016.

[16] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.

[17] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: geometry and kantorovich formulation. arXiv preprint arXiv:1508.05216, 2015.

[18] Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems and a new hellinger-kantorovich distance between positive measures. arXiv preprint arXiv:1508.07941, 2015.

[19] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.

[20] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. In ICLR, 2017.

[21] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generator-encoder networks, 2017.

[22] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks, 2017.

[23] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015.

[24] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.

[25] R. Reddi, A. Ramdas, A. Singh, B. Poczos, and L. Wasserman. On the high-dimensional power of a linear-time two sample test under mean-shift alternatives. In AISTATS, 2015.

[26] C. L. Li, W. C. Chang, Y. Cheng, Y. Yang, and B. Poczos. Mmd gan: Towards deeper understanding of moment matching network, 2017.

[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86(11), pages 2278–2324, 1998.

[28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.

[29] D. P. Kingma and J. Lei. Adam: A method for stochastic optimization, 2014.

[30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.

[31] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.

[32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.

[33] B. Poole, A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for GANs, 2016.

[34] S. Kolouri, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoencoder: An embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.

conditional neural process

2018-11-23T06:24:10Z

H454chen: /* Critiques */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>.

A common assumption made on P is that all function evaluations of <math display="inline"> f </math> is Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of the stochastic process allows a model to be data efficient, however, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of the examples is GPs, which has running time <math>O(n+3)^3</math>.

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, the authors assume that <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.

In detail, the following architecture is used

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, Monte Carlo estimates of the gradient of this loss is taken by sampling <math display="inline">f</math> and <math display="inline">N</math>.

This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Related Work ==

===Gaussian Process Framework===

A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.
A standard approach is to model data as <math>y = m(X, φ) + \epsilon</math>
where m is the mean function with parameter vector <math>φ</math>, and <math>\epsilon</math> represents independent and identically distributed (i.i.d.) Gaussian noise: <math>N\sim (0,\sigma^2)</math>

For more info on Gaussian Process Framework:
[https://arxiv.org/abs/1506.07304 A Gaussian process framework for modelling instrumental systematics: application to transmission spectroscopy]

Several papers attempt to address various issues with GPs. These include:
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)
* Using Deep GPs to achieve more expressivity (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017)
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)

A Python resource for Gaussian Process Framework implementation: [https://github.com/SheffieldML/GPyimplementation Gaussian Process Framework in Python]

The goal of this paper is to incorporate ideas from standard neural networks with Gaussian processes in order to overcome drawbacks of both. Bayesian techniques work better with less data, but complex Bayesian networks become intractable on even moderate sized data sizes. NNs on the other hand, cannot make use of prior knowledge and often have to be retrained from scratch. Without sufficient data, they also perform poorly. Combining both frameworks, we get Conditional Neural Processes serves to learn the kernels of the Gaussian Process through neural networks, and uses these learned kernels on a framework similar to GPs for prediction.

===Meta Learning===

Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms or additional memory.

Classification is another common task in meta-learning, few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs.

Finally, the latest variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). For example, Gal & Ghahramani 2016 develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Their theory extracts information from existing models and gives us tools to model uncertainty.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is the first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset, the function switched at some random point. on the real line between two functions, each sampled with
different kernel parameters. At every training step, they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three-layer MLP encoder h with a 128-dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless, it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore, the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average overall MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations, the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
then selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colors of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to the ground truth. As before, given a few contexts
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
it's flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict the pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually, this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when a number of context points are small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. The authors used 1,200 randomly selected classes as their training set and the remainder as the testing data set.

Additionally, to apply data augmentation the authors cropped the image from 32 × 32 to 28 × 28, applied small random
translations and rotations to the inputs, and also increased
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|400px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition, they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes is constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper, they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work, they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Critiques ==

This paper introduces a method, for reducing the computational complexity of the more famous Gaussian Processes model, but they have mentioned a complexity of O(n + m) which is almost the same order of RBF kernel GP. With respect to performances in a sequence of tasks, the authors have not made metric comparisons to GP methods to prove the superiority of their approach.

It appears that the proposed model is effective in making accurate predictions using lower quality inputs. For example, a dataset with fewer data points or an image with fewer pixels. However, it is not clear whether the proposed algorithm can be trained with a smaller amount of input data.

== Other Sources ==
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

conditional neural process

2018-11-23T06:23:32Z

H454chen: /* Critiques */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>.

A common assumption made on P is that all function evaluations of <math display="inline"> f </math> is Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of the stochastic process allows a model to be data efficient, however, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of the examples is GPs, which has running time <math>O(n+3)^3</math>.

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, the authors assume that <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.

In detail, the following architecture is used

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, Monte Carlo estimates of the gradient of this loss is taken by sampling <math display="inline">f</math> and <math display="inline">N</math>.

This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Related Work ==

===Gaussian Process Framework===

A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.
A standard approach is to model data as <math>y = m(X, φ) + \epsilon</math>
where m is the mean function with parameter vector <math>φ</math>, and <math>\epsilon</math> represents independent and identically distributed (i.i.d.) Gaussian noise: <math>N\sim (0,\sigma^2)</math>

For more info on Gaussian Process Framework:
[https://arxiv.org/abs/1506.07304 A Gaussian process framework for modelling instrumental systematics: application to transmission spectroscopy]

Several papers attempt to address various issues with GPs. These include:
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)
* Using Deep GPs to achieve more expressivity (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017)
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)

A Python resource for Gaussian Process Framework implementation: [https://github.com/SheffieldML/GPyimplementation Gaussian Process Framework in Python]

The goal of this paper is to incorporate ideas from standard neural networks with Gaussian processes in order to overcome drawbacks of both. Bayesian techniques work better with less data, but complex Bayesian networks become intractable on even moderate sized data sizes. NNs on the other hand, cannot make use of prior knowledge and often have to be retrained from scratch. Without sufficient data, they also perform poorly. Combining both frameworks, we get Conditional Neural Processes serves to learn the kernels of the Gaussian Process through neural networks, and uses these learned kernels on a framework similar to GPs for prediction.

===Meta Learning===

Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms or additional memory.

Classification is another common task in meta-learning, few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs.

Finally, the latest variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). For example, Gal & Ghahramani 2016 develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Their theory extracts information from existing models and gives us tools to model uncertainty.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is the first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset, the function switched at some random point. on the real line between two functions, each sampled with
different kernel parameters. At every training step, they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three-layer MLP encoder h with a 128-dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless, it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore, the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average overall MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations, the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
then selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colors of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to the ground truth. As before, given a few contexts
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
it's flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict the pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually, this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when a number of context points are small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. The authors used 1,200 randomly selected classes as their training set and the remainder as the testing data set.

Additionally, to apply data augmentation the authors cropped the image from 32 × 32 to 28 × 28, applied small random
translations and rotations to the inputs, and also increased
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|400px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition, they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes is constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper, they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work, they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Critiques ==

This paper introduces a method, for reducing the computational complexity of the more famous Gaussian Processes model, but they have mentioned a complexity of O(n + m) which is almost the same order of RBF kernel GP. With respect to performances in a sequence of tasks, the authors have not made metric comparisons to GP methods to prove the superiority of their approach.

It appears that the proposed model is effective in making accurate predictions using lower quality inputs. For example, a dataset with fewer data points or an image with fewer pixels. However, it is not clear whether the proposed algorithm can be trained with fewer input data.

== Other Sources ==
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

conditional neural process

2018-11-23T06:02:38Z

H454chen: /* Introduction */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>.

A common assumption made on P is that all function evaluations of <math display="inline"> f </math> is Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of the stochastic process allows a model to be data efficient, however, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of the examples is GPs, which has running time <math>O(n+3)^3</math>.

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, the authors assume that <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.

In detail, the following architecture is used

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, Monte Carlo estimates of the gradient of this loss is taken by sampling <math display="inline">f</math> and <math display="inline">N</math>.

This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Related Work ==

===Gaussian Process Framework===

A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.
A standard approach is to model data as <math>y = m(X, φ) + \epsilon</math>
where m is the mean function with parameter vector <math>φ</math>, and <math>\epsilon</math> represents independent and identically distributed (i.i.d.) Gaussian noise: <math>N\sim (0,\sigma^2)</math>

For more info on Gaussian Process Framework:
[https://arxiv.org/abs/1506.07304 A Gaussian process framework for modelling instrumental systematics: application to transmission spectroscopy]

Several papers attempt to address various issues with GPs. These include:
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)
* Using Deep GPs to achieve more expressivity (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017)
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)

A Python resource for Gaussian Process Framework implementation: [https://github.com/SheffieldML/GPyimplementation Gaussian Process Framework in Python]

The goal of this paper is to incorporate ideas from standard neural networks with Gaussian processes in order to overcome drawbacks of both. Bayesian techniques work better with less data, but complex Bayesian networks become intractable on even moderate sized data sizes. NNs on the other hand, cannot make use of prior knowledge and often have to be retrained from scratch. Without sufficient data, they also perform poorly. Combining both frameworks, we get Conditional Neural Processes serves to learn the kernels of the Gaussian Process through neural networks, and uses these learned kernels on a framework similar to GPs for prediction.

===Meta Learning===

Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms or additional memory.

Classification is another common task in meta-learning, few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs.

Finally, the latest variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). For example, Gal & Ghahramani 2016 develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Their theory extracts information from existing models and gives us tools to model uncertainty.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is the first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset, the function switched at some random point. on the real line between two functions, each sampled with
different kernel parameters. At every training step, they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three-layer MLP encoder h with a 128-dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless, it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore, the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average overall MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations, the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
then selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colors of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to the ground truth. As before, given a few contexts
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
it's flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict the pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually, this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when a number of context points are small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. The authors used 1,200 randomly selected classes as their training set and the remainder as the testing data set.

Additionally, to apply data augmentation the authors cropped the image from 32 × 32 to 28 × 28, applied small random
translations and rotations to the inputs, and also increased
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|400px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition, they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes is constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper, they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work, they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Critiques ==

This paper introduces a method, for reducing the computational complexity of the more famous Gaussian Processes model, but they have mentioned a complexity of O(n + m) which is almost the same order of RBF kernel GP. With respect to performances in a sequence of tasks, the authors have not made metric comparisons to GP methods to prove the superiority of their approach.

== Other Sources ==
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Learning to Navigate in Cities Without a Map

2018-11-23T05:59:01Z

H454chen: /* Critique */

Paper:
[https://arxiv.org/pdf/1804.00168.pdf Learning to Navigate in Cities Without a Map]
A video of the paper is available [https://sites.google.com/view/streetlearn here].

== Introduction ==
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.

1. Building an explicit map

2. Planning and acting using that map.

In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (neither map nor current position have not been used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]

==Contribution==
This paper has made the following contributions:

1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.

2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.

3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.

4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.

==Related Work==

1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment.

2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.

3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.

4. To make simulations resemble reality, researchers have developed higher-fidelity simulated environments (Dosovitskiy et al., 2017; Kolve et al., 2017; Shah et al., 2018; Wu et al., 2018). However, in spite of the photo-realism, the inherent problems of simulated environments pertain to the limited diversity of the environments and the idealistic cleanliness of the observations.

==Environment==
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.

[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]

==Agent Interface and the Courier Task==
In an RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action). The most central edge is chosen if there are multiple edges in the agents viewing cone.

There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp⁡(-αd_{(t,i)}^g)}{∑_k exp⁡(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).

[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]

This form of representation has several advantages:

1. It could easily be extended to new environments.

2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.

3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.

In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.

==Methods==

===Goal-dependent Actor-Critic Reinforcement Learning===
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} → \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>:

\begin{align}
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)
\end{align}

Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:

\begin{align}
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]
\end{align}

Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.

===Architectures===

[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]

The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.

Three different architectures are described below.

The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.

The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information.

The '''MultiCityNav''' architecture (Fig. 4c) is an extension of CityNav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.

In this paper, the authors use IMPALA [1] to train the agents because IMPALA can get similar performance to A3C [2].

===Prior on agent training: IMPALA and A3C===

IMPALA (Importance Weighted Actor-Learner Architecture) is an actor-critic implementation of deep reinforcement learning that decouples actions from learning. IMPALA results in a comparable performance to A3C (Google DeepMind's previous algorithm: Asynchronous Actor-Critic Agents) on a single city task, but it has been shown to handle better multi-task learning than A3C. The authors use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50.

===Curriculum Learning===
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)

Curriculum learning was first introduced by Bengio et. al in 2009. It serves as a continuation method for non-convex optimization, and improves training time by injecting noisy data. One example outside this paper for curriculum learning is outlined below:

1. We aim to classify shapes within the following three classes: triangles, ellipses, and rectangles. We can create a curriculum by first starting with a simplified dataset that consists of only special cases of these three classes: equilateral triangles, circles, and squares. By first training on these special cases, and then introducing the full model, we can allow the algorithm to converge more quickly towards a local minima before providing "harder" examples. Feeding only these specialized examples also serves as a method to make the classes fall on more distinct manifold locations; with less overlap, these networks will perform better when noise is later added as well.

==Results==
In this section, the performance of the proposed architectures on the courier task is shown.

[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way
as in London) are comparable and the agent achieved mean goal reward 426.]]

It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:

1. Goal Navigation agent.

2. City Navigation Agent.

3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.

Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.

The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.

[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]

Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach
a goal (Washington Square in New York or St Paul’s Cathedral in
London) from 100 start locations vs. the straight-line distance to
the goal in meters. One agent step corresponds to a forward movement
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]

The authors mask 25% of the possible goals and train on the remaining ones in order to investigate the generalisation capability of a trained agent. Figure 8 Showa that the agent is still able to traverse through these areas, it just never samples a goal there.
[[File:fff8.png|600px|center]]

A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]

Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:

a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.

b) Binned representation.

The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.

[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]

==Conclusion==
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented through the use of Google StreetView for its photographic content and worldwide coverage. Furthermore, the authors discussed a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities. A successful navigation architecture is presented which relies on integration of general policies with locale-specific knowledge.

==Critique==
1. It is not clear how this model is applicable to the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modeling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.

2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world. Also, this is quite understandable not to use maps but is not clear why have they not used GPS to know their position and maybe even made up with a map. This can be something useful in an emergency or even for investigating places that are not known or there is no access to them. The resulting map could be easily compared with the real one and could also be used in training to achieve higher performance. The availability should not be a serious problem because if they are simulating a real city and the google images are available, why should not GPS be? What is the intuition? At least, a complementary description on this could be helpful.

3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.

4. The proposed navigation model could be limited by its reliance on pre-defined landmarks, which appears to be strategically placed evenly spreading across each city. This could limit the agent's deployability to new cities.

==Reference==
[1] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, and Kavukcuoglu, Koray. Impala: Scalable distributed deep-rl with importance weighted actor-learner architec- tures. arXiv preprint arXiv:1802.01561, 2018.

[2] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning, pp. 1928–1937, 2016.

Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling

2018-11-23T05:33:24Z

H454chen: /* AlphaGo Lee */

This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]

= Introduction and Motivation =

In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite.

Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement.

This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.

Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.

== Curling ==

Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the gameplay, and potential challenges/concerns for learning algorithms. A terminology section follows.

=== Gameplay ===

A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.

When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).

Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:

* Draw: Throw a rock to a target location
* Freeze: Draw a rock up against another rock
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)

=== Challenges for AI ===

Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.

The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.

Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.

Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.

Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.

=== Terminology ===

* End: A round of the game
* House: The end of the sheet of ice, which contains
* Hammer: The team that throws the last rock of an end 'has the hammer'
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.

== Related Work ==

=== AlphaGo Lee ===

AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.

The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert gameplay data and improved through self-play.

The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network.

Finally, the two networks are combined using Monte-Carlo Tree Search, which performs lookahead search to select the actions for gameplay.

The use of both policy and value networks are reflected in this paper's work.

=== AlphaGo Zero ===

AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.

The unification of networks and self-play are also reflected in this paper.

=== Curling Algorithms ===

Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.

=== Monte Carlo Tree Search ===

Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers.

With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.

=== Curling Physics and Simulation ===

Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixe friction coefficient was predefined in the simulation. The behaviour of the stones was also modelled. Important parameters are trained from professional players. The authors used the same parameters in this paper.

== General Background of Algorithms ==

=== Policy and Value Functions ===

A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.

POLICY IMPROVEMENT: LEARNING ACTION POLICY

Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. We can use policy gradient reinforcement learning to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]
where <math> r(s_t) </math> is the return.

POLICY EVALUATION: LEARNING VALUE FUNCTIONS

A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]

=== Monte Carlo Tree Search ===

Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).

MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.

Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.

MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.

The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.

Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.

The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.

A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.

[[File:mcts_uct_equation.png | 500px | centered]]

Sources: 2,3,4

=== Kernel Regression ===

Kernel regression is a form of weighted averaging. Given two items of data, '''x''', each of which has a value '''y''' associated with them, the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.

A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).

[[File:gaussian_kernel.png | 400 px]]

[[File:kernel_regression.png | 350 px]]

In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.

= Methods =

== Variable Definitions ==

The following variables are used often in the paper:

* <math>s</math>: A state in the game, as described below as the input to the network.
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game
* <math>a_t</math>: The action taken in state <math>s_t</math>
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS
* <math>n_{a_t}</math>: The number of visits to node a in MCTS
* <math>v_{a_t}</math>: The MCTS value estimate of a node

== Network Design ==

The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.

=== Shared Structure ===

The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:

[[File:curling_network_layers.png]]

the input to this network is the following:
* Location of stones
* Order to tee (the center of the sheet)
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.

The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.

=== Policy Network ===

The policy head is created by adding 2 convolutional layers with 2 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.

=== Value Network ===

The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.

== Continuous Action Search ==

The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.

Actions in the continuous space are generated using an MCTS algorithm, with the following steps:

=== Selection ===

From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.

[[File:curling_kernel_equations.png | 500px]]

The UCB formula is then used to select an action to expand.

The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.

=== Expansion ===

The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.

=== Simulation ===

Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.

=== Backpropogation ===

Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.

== Supervised Learning ==

During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.

=== Policy Network ===

The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.

=== Value Network ===

The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.

=== Policy-Value Network ===

The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:

* Algorithm: stochastic gradient descent
* Batch size: 256
* Momentum: 0.9
* L2 regularization: 0.0001
* Training time: ~100 epochs
* Learning rate: initialized at 0.01, reduced twice

A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:

[[File:curling_loss_function.png | 300px]]

== Self-Play Reinforcement Learning ==

After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.

At a game state ''s<sub>t</sub>'':

1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.

2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.

It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.

The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.

It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.

== Long-Term Strategy Learning ==

Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.

The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.

== Final Algorithms ==

The authors make use of the following versions of their algorithm:

=== KR-DL ===

''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.

=== KR-DRL ===

''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.

=== KR-DRL-MES ===

''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.

= Testing and Results =
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.

== Comparison of KR-DL-UCT and DL-UCT ==

The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.

As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.

[[File:curling_KR_test.png | 400px]]

== Matches ==

Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. The figure below shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors.

[[File:curling_ratings.png | 400px]]

= Critique =

== Strengths ==

This algorithm out-performs other high-performance algorithms (including past competition champions).

I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.

The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.

== Weaknesses ==

I found this paper difficult to follow at times. One problem was that the algorithms were introduced first, and then how they were used as described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run, and what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.

While I think the comparing of different algorithms was done well, I believe it still lacked some good detail. There were one-off mentions in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.

At this point, the algorithms used still rely on initialization by a pre-made program.

There was little theoretical development or justification done in this paper.

While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limiting. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.

While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.

=References=
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)
# https://www.baeldung.com/java-monte-carlo-tree-search
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/
# https://int8.io/monte-carlo-tree-search-beginners-guide/
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.

Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples

2018-11-20T20:29:11Z

H454chen: /* Critique */

= Introduction =
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios.

The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to ''perceive'' a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.

= Methodology =

The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.

== Background ==

==== Adversarial Images Mathematically ====
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:
# <math>D(x,x') < \epsilon </math>
# <math>c(x') \neq c^*(x) </math>

Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.

==== Adversarial Attacks Terminology ====
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize'' loss for a target class <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.

In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.

== Obfuscated Gradients ==
If gradients are masked, they cannot be followed to generate adversarial images, gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness five a false sense of security against adversarial attacks. Defenses are designed in a way that the constructed defense inevitably leads to gradient masking as obfuscated gradients. In the defenses proposed in ICLR 2018, there are three ways which defense obfuscate gradients:

# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally.
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself.
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.

'''Detecting Obfuscated Gradients''':

The authors propose a number of tests that might help detect when a defense relies on obfuscated gradients.

Iterative attacks should work better than single-step attacks, since iterative attacks are strictly stronger than single-step attacks.
White-box attacks should perform better than black-box attacks, since the black-box threat model is a strict subset of the white-box threat model.
Attacks with an unbounded distortion metric (e.g. L_2 norm) should find adversarial examples with 100% success.
Optimization-based attacks should perform better than brute-force sampling of nearby inputs (sampling within an ϵ-ball).
These tests may not cover all cases of obfuscated gradients, but they indicate when intuitive properties start to break down. All defenses with obfuscated gradients discussed by the authors fail at least one test.

== The Attacks ==
To circumvent these gradient masking techniques, the authors propose:
# '''Backward Pass Differentiable Approximation (BPDA)''': For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation.
# '''Expectation over Transformation [Athalye, 2017]''': For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.
They assume that given a classifier <math display = "inline">f(g(x))</math>, <math display = "inline">g(·)</math> performs some optimization loop to transform the input x to a new input <math display = "inline">\hat x</math>. Often times, differentiating through <math display = "inline">g(·)</math> yields exploding or vanishing gradients.

To resolve this, they make a change-of-variable <math display = "inline">x = h(z)</math> for some function <math display = "inline">h(·)</math> such that <math display = "inline">g(h(z)) = h(z)</math> for all z, but <math display = "inline">h(·)</math> is differentiable. This allows them to compute gradients through f(h(z)) and hence circumvent the defense.

= Main Results =
[[File:Summary_Table.png|600px|center]]

The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images.

Standard models were used in evaluating the accuracy of defense strategies under the attacks,
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)
# Imagenet: InceptionV3 (78.0% top-1 accuracy)

The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.

==== The defense that worked - Adversarial Training [Madry, 2018] ====

As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:
\begin{align}
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]}
\end{align}

where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>.

Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.

This approach was shown to provide resilience to all types of adversarial attacks.

==== How to check for Obfuscated Gradients ====
For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defense is relying on masked gradients:
# If weaker one-step attacks are performing better than iterative attacks.
# Black-box attacks can find stronger adversarial images compared with white-box attacks.
# Unbounded iterative attacks do not reach 100% success.
# If random brute force attempts are better than gradient-based methods at finding adversarial images.

= Detailed Results =

As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.

== Non-obfuscated Gradients ==

==== Cascade Adversarial Training, [Na, 2018] ====
'''Defense''': Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.

'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.

== Gradient Shattering ==

==== Thermometer Coding, [Buckman, 2018] ====
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,

Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:
\begin{align}
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}
1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\
0 \space otherwise \\
\end{array}
\end{align}
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.

On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.

'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,
\begin{align}
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )
\end{align}
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.

==== Input Transformation, [Guo, 2018] ====
'''Defense''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database.

Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.

'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.

==== Local Intrinsic Dimensionality, [Ma, 2018] ====
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,

\begin{align}
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}
\end{align}
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k.

First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.

'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.

== Stochastic Gradients ==

==== Stochastic Activation Pruning, [Dhillon, 2018] ====
'''Defense''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.

'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.

==== Mitigation Through Randomization, [Xie, 2018] ====
'''Defense''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.

'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.

== Vanishing and Exploding Gradients ==

==== Pixel Defend, [Song, 2018] ====
'''Defense''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],

\begin{align}
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))
\end{align}

The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.

'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.

==== Defense-GAN, [Samangouei, 2018] ====

Before classifying the samples, Defense-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.

The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.

= Conclusion =
In this paper, it was found that gradient masking is a common flaw in many defenses claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defenses proposed in ICLR-2018, is significantly lower. In fact, many defenses were found to be completely ineffective.

Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.

= Critique =
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret.
## A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA.
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.
# In general, the research field of adversarial attack would benefit from having an all-encompassing benchmark or dataset, so that the various approaches can be objectively compared and evaluated.

= Other Sources =
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].

= References =
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).
# '''[Na, 2018]''' Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.
# '''[Papernot et al., 2017]''' Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.
# '''[Tramer et al., 2018]''' Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.

Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples

2018-11-20T06:13:50Z

H454chen:

= Introduction =
Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios.

The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to ''perceive'' a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.

= Methodology =

The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.

== Background ==

==== Adversarial Images Mathematically ====
Given an image <math>x</math> and a classifier <math>f(x)</math>, an adversarial image <math>x'</math> satisfies two properties:
# <math>D(x,x') < \epsilon </math>
# <math>c(x') \neq c^*(x) </math>

Where <math>D</math> is some distance metric, <math>\epsilon </math> is a small constant, <math>c(x')</math> is the output ''class'' predicted by the model, and <math>c^*(x)</math> is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.

==== Adversarial Attacks Terminology ====
#Adversarial attacks can be either '''black''' or '''white-box'''. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.
#In '''untargeted''' attacks, the objective is to ''maximize'' the loss of the true class, <math>x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))</math>. While in '''targeted''' attacks, the objective is to ''minimize'' loss for a target class <math>c^t(x)</math> that is different from the true class, <math>x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))</math>. Here, <math>\nabla_xL()</math> is the gradient of the loss function with respect to the input, <math>\lambda</math> is a small gradient step and <math>sign()</math> is the sign of the gradient.
# An attacker may be allowed to use a single step of back-propagation ('''single step''') or multiple ('''iterative''') steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.

In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.

== Obfuscated Gradients ==
If gradients are masked, they cannot be followed to generate adversarial images, gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness five a false sense of security against adversarial attacks. Defenses are designed in a way that the constructed defense inevitably leads to gradient masking as obfuscated gradients. In the defenses proposed in ICLR 2018, there are three ways which defense obfuscate gradients:

# '''Shattered gradients''': Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally.
# '''Stochastic gradients''': A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself.
# '''Vanishing Gradients ''': Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.

== The Attacks ==
To circumvent these gradient masking techniques, the authors propose:
# '''Backward Pass Differentiable Approximation (BPDA)''': For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation.
# '''Expectation over Transformation [Athalye, 2017]''': For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.
# '''Re-parameterize the exploration space''': For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.

= Main Results =
[[File:Summary_Table.png|600px|center]]

The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics (<math>\ell_{\infty}</math> and <math>\ell_{2}</math>) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For <math>\ell_{\infty}</math> adversarial images, each pixel is allowed to vary by a maximum amount. For example, <math>\ell_{\infty}=0.031</math> specifies that each pixel can vary by <math>256*0.031=8</math> from its original value. <math>\ell_{2}</math> distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images.

Standard models were used in evaluating the accuracy of defense strategies under the attacks,
# MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)
# CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)
# Imagenet: InceptionV3 (78.0% top-1 accuracy)

The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.

==== The defense that worked - Adversarial Training [Madry, 2018] ====

As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation:
\begin{align}
\theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]}
\end{align}

where <math>\theta</math> is the parameter of the model, <math>\theta^*</math> is the optimal set of parameters and <math>\delta</math> is a small perturbation to the input image <math>x</math> and is bounded by <math>[-\epsilon,\epsilon]</math>.

Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.

This approach was shown to provide resilience to all types of adversarial attacks.

==== How to check for Obfuscated Gradients ====
For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defense is relying on masked gradients:
# If weaker one-step attacks are performing better than iterative attacks.
# Black-box attacks can find stronger adversarial images compared with white-box attacks.
# Unbounded iterative attacks do not reach 100% success.
# If random brute force attempts are better than gradient-based methods at finding adversarial images.

= Detailed Results =

As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.

== Non-obfuscated Gradients ==

==== Cascade Adversarial Training, [Na, 2018] ====
'''Defense''': Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.

'''Attack''': The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.

== Gradient Shattering ==

==== Thermometer Coding, [Buckman, 2018] ====
'''Defense''': Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,

Given an image, for each pixel value <math>x_{i,j,c}</math>, if an <math>l</math> dimensional thermometer code, the <math>kth</math> bit is given by:
\begin{align}
\tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll}
1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\
0 \space otherwise \\
\end{array}
\end{align}
Here it is assumed <math>x_{i,j,c} \in [0, 1] </math> and <math>i, j, c</math> are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.

On CIFAR-10, the model gave 50% accuracy against <math>\ell_\infty</math> adversarial images with <math>\epsilon=0.031</math> attacks.

'''Attack''': The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, <math>\tau(x)</math>, they develop a differentiable counterpart,
\begin{align}
\hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 )
\end{align}
and use it in place of <math>\tau(x)</math> on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.

==== Input Transformation, [Guo, 2018] ====
'''Defense''':[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database.

Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with <math>l_{2}=0.05</math> perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.

'''Attack''': The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.

==== Local Intrinsic Dimensionality, [Ma, 2018] ====
'''Defense''' Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample <math>x</math> and its <math>ith</math> neighbor be <math>r_i(x)</math>, then the LID under the choose distance metric is given by,

\begin{align}
LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1}
\end{align}
where k is the number of nearest neighbors considered, <math>r_k(x)</math> is the maximum distance to any of the neighbors in the set k.

First, <math>L_2</math> distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.

'''Attack''': Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.

== Stochastic Gradients ==

==== Stochastic Activation Pruning, [Dhillon, 2018] ====
'''Defense''': [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.

'''Attack''': The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.

==== Mitigation Through Randomization, [Xie, 2018] ====
'''Defense''': [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to <math>r \in [299,331]</math>. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.

'''Attack''': The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using <math>L_{\infty}(\epsilon=0.031)</math> perturbations.

== Vanishing and Exploding Gradients ==

==== Pixel Defend, [Song, 2018] ====
'''Defense''': [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],

\begin{align}
p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1}))
\end{align}

The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient
problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with <math>l_{\infty} (\epsilon=0.031) </math> perturbations.

'''Attack''': The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.

==== Defense-GAN, [Samangouei, 2018] ====

Before classifying the samples, Defense-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.

The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.

= Conclusion =
In this paper, it was found that gradient masking is a common flaw in many defenses claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defenses proposed in ICLR-2018, is significantly lower. In fact, many defenses were found to be completely ineffective.

Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.

= Critique =
# The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.
# The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret.
## A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA.
# The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.

= Other Sources =
# Their re-implementation of each of the defenses and implementations of the attacks are available [https://github.com/anishathalye/obfuscated-gradients here].

= References =
#'''[Madry, 2018]''' Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
#'''[Buckman, 2018]''' Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.
#'''[Guo, 2018]''' Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.
#'''[Xie, 2018]''' Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.
#'''[song, 2018]''' Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.
#'''[Szegedy, 2013]''' Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
#'''[Samangouei, 2018]''' Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.
#'''[van den Oord, 2016]''' van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).
#'''[Athalye, 2017]''' Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.
#'''[Ma, 2018]''' Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).
# '''[Na, 2018]''' Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.
# '''[Papernot et al., 2017]''' Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.
# '''[Tramer et al., 2018]''' Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.

Countering Adversarial Images Using Input Transformations

2018-11-20T06:05:51Z

H454chen:

The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]

==Motivation ==
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations of the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.

[[File:Panda.png|center]]

==Introduction==
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier.
Generally, defenses against adversarial examples fall into two main categories:

# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm.
# Model-Agnostic – They try to remove adversarial perturbations from the input.

Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:

# Image Cropping and Re-scaling (Graese et al, 2016).
# Bit Depth Reduction (Xu et. al, 2017)
# JPEG Compression (Dziugaite et al, 2016)
# Total Variance Minimization (Rudin et al, 1992)
# Image Quilting (Efros & Freeman, 2001).

These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack.

From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them.

==Previous Work==
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images.

==Terminology==

'''Gray Box Attack''' : Model Architecture and parameters are Public

'''Black Box Attack''': Adversary does not have access to the model.

'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.

'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.

'''Defense''': A defense is a strategy that aims make the prediction on an adversarial example h(x') equal to the prediction on the corresponding clean example h(x).

== Problem Definition ==
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.

From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).

The success rate of an attack is given as:

[[File:Attack.PNG|200px |]],

which is the proportions of predictions that were altered by an attack.

The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math>

A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.

==Adversarial Attacks==

For the experimental purposes, below 4 attacks have been studied in the paper:

1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:

<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math>

for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.

2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:

<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math>

where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.

Both FGSM and I-FGSM work by minimizing the Chebyshov distance between the inputs and the generated adversarial examples.

3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:

[[File:DeepFool.PNG|400px |]]

4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:

[[File:Carlini.PNG|500px |]]

As mentioned earlier, the first two attacks minimize the Chebyshov distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.

All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping.

Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.

[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]

==Defenses==
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.
Five image transformations that alter the structure of these perturbations have been studied:
# Image Cropping and Re-scaling,
# Bit Depth Reduction,
# JPEG Compression,
# Total Variance Minimization,
# Image Quilting.

'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.

'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.

'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments

'''Total Variance Minimization (Rudin et. al) [9]''' :
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected
set of pixels, whilst also being “simple” in terms of total variation by solving:

[[File:TV!.png|300px|]] ,

where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :

[[File:TV2.png|500px|]]

The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image.

'''Image Quilting (Efros & Freeman, 2001) [8]'''
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.

=Experiments=

Five experiments were performed to test the efficacy of defences. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defence strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work.

'''Set up:'''
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:

- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.

- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.

- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.

- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation

The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.

Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3
[[File:models3.png |center]]

==GrayBox - Image Transformation at Test Time==
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.

[[File:sFig4.png|center|600px |]]

==BlackBox - Image Transformation at Training and Test Time==
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.

[[File:sFig5.png|center|600px |]]

==Blackbox - Ensembling==
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.

[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]

==GrayBox - Image Transformation at Training and Test Time ==
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.

[[File:sFig6.png|center| 600px |]]

==Comparison With Ensemble Adversarial Training==
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. THe authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the preprocessing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.

[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]

=Discussion/Conclusions=
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping- Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. It is also concluded that randomness is particularly important in developing strong defenses. Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.

=Critiques=
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.

2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.

3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.

4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.

=References=

1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations

2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.

3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016.

4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.

5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017.

6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.

7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3

8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.

9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.

10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.

11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.

12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017

13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.

14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.

15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.

16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.

17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.

Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

2018-11-20T06:00:11Z

H454chen:

==Introduction==

The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.

This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models are not good enough and tend to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets.

Like every other process, the process of collecting real-world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors:

1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes

2. Collect training data in 6 different homes and testing data in 3 homes

3. Propose a method for modelling the noise in the labelled data

4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation

[[File:aa1.PNG|600px|thumb|center|]]

==Overview==

This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.

As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.

As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.

==Learning on low-cost robot data==

This paper uses a patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.

===Grasping Formulation===

Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used.

Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.

(Note: On Cross Entropy:

If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool. This is optimal, in that we can't encode the symbols using fewer bits on average.
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:

\begin{align}
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}
\end{align}

Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)

===Modeling noise as latent variable===

In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2:

[[File:aa2.PNG|600px|thumb|center|]]

The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.

The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:

\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]

Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.

===Learning the latent noise model===

They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. They used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location.. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.

===Training details===

They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.

==Results==

In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.

[[File:aa3.PNG|600px|thumb|center|]]

To evaluate their approach in a more quantitative way, they used three test settings:

- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.

- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.

- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.

They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.

===Experiment 1: Performance on held-out data===

Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.

[[File:aa4.PNG|600px|thumb|center|]]

===Experiment 2: Performance on Real LCA Robot===

In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high-quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high-quality sensing.

[[File:aa5.PNG|600px|thumb|center|]]

===Performance on Real Sawyer===

To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for benchmarking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in a different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.

[[File:aa6.PNG|600px|thumb|center|]]

[[File:aa7.PNG|600px|thumb|center|]]

==Related work==

Over the last few years, the interest of scaling up robot learning with large-scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large-scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.

Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high-quality depth as input which seems to be a barrier for practical use of robots in real environments. High-quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.

Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low-cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.

Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.

==Conclusion==

All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.

==Critiques==

This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.

Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is, in fact, a fine-tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.

[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]

The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.

They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.

==References==

#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor-critic for image-based robot learning." Robotics Science and Systems, 2018.
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.

Learning to Teach

2018-11-06T19:04:48Z

H454chen: /* Critique */

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.
::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.

Learning to Teach

2018-11-06T18:55:48Z

H454chen: /* Critique */

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.
::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model should facilitate transfer learning and significantly reduce student training time. The T2L framework seems to fall short in that aspect. For example, in comparison with other learning approaches, it still requires

Learning to Teach

2018-11-06T18:35:51Z

H454chen:

=Introduction=

This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.

In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations, and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.

Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.

To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.

=Related Work=
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)

The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data.

The limitations of these works include the lack of a formally defined teaching problem, and the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.

=Learning to Teach=
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.

In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.

==Problem Definition==
The student model, denoted μ(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:

\begin{align*}
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)
\end{align*}

The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.

::'''Training Data''': Outputting a good training set <math> D </math>, analogous to human teachers providing students with proper learning materials such as textbooks.
::'''Loss Function''': Designing a good loss function <math> L </math> , analogous to providing useful assessment criteria for students.
::'''Hypothesis Space''': Defining a good function class <math> Ω </math> which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).

==Framework==
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters. The L2T process is outlined in figure below:

[[File: L2T_process.png | 500px|center]]

* <math> s_t ∈ S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.
* <math> a_t ∈ A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space.
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math>
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.

Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.

=Application=

There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns.

The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.

The optimizer for training the teacher model is the maximum expected reward:

\begin{align}
J(θ) = E_{φ_θ(a|s)}[R(s,a)]
\end{align}

Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]

==Experiments==

The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN).

The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset.

The strategy will be benchmarked against the following teaching strategies:

::'''NoTeach''': Outputting a good training set D, analogous to human teachers providing students with proper learning materials such as textbooks
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness.
::'''L2T''': The Learning to Teach framework.
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).

===Training a New Student===

In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed, and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:

[[File: L2T_speed.png | 1100px|center]]

===Filtration Number===

When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.

[[File: L2T_fig3.png | 1100px|center]]

===Teaching New Student with Different Model Architecture===

In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model
which has a different model architecture is taught.
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.

[[File: L2T_fig4.png | 1100px|center]]

===Training Time Analysis===

The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.

[[File: L2T_fig5.png | 600px|center]]

===Accuracy Improvement===

When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.

[[File: L2T_t1.png | 500px|center]]

=Future Work=

There is some useful future work that can be extended from this work:

1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper.

2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework.

3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings.

4) As they have focused on data teaching exploring loss function teaching would be interesting.

=Critique=

While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.

I admire the idea of having a generalizable teacher model to enhance student learning. In some ways, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. However, I presume one of the intended benefits of having a teacher model is to help students learn faster, and it seems like the T2L framework is not able to deliver that benefit.

File:Oct30 associative embedding appendix fig2.jpg

2018-11-03T03:26:18Z

H454chen: Reference figure from paper: Alejandro Newell and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424, 2016.

Reference figure from paper:
Alejandro Newell and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424, 2016.

Pixels to Graphs by Associative Embedding

2018-11-03T03:24:40Z

H454chen:

== Introduction ==
The paper presents a novel approach to generating a scene graph. A scene graph, as it relates to an image, is a graph with a vertex that represents each object identified in the image and an edge that represents relationships between the objects.

An example of a scene graph:

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Scene Graph.PNG]]</div>

Current state-of-the-art techniques break down the construction of scene graphs by first identifying objects and then predicting the edges for any given pair of identified objects. By using this technique, reasoning over
the full graph would be limited. On the other hand, this paper introduces an architecture that defines the entire graph directly from the image, enabling the network to reason across the entirety of the image to understand relationships, as opposed to only predicting relationships using object labels.

A key concern, given that the new architecture produces both vertices (objects) and edges (relationships), is connecting the two. Specifically, the output of the network is some set of relationships E, and some set of vertices V. The network needs to also output the “source” and “destination” of each relationship, so that the final graph can be formed. In the image above, for example, the network would also need to tell us that “holding” comes from “person” and goes to “Frisbee”. To do this, the paper uses associative embeddings. Specifically, the network outputs a particular “embedding vector” for each vertex, as well as a “source embedding” and “destination embedding” for each relationship. A final post-processing step finds the vertex embedding closest to each of the source/destination embeddings of each relationship and in this way assigns the edges to pairs of vertices.

== Previous Work ==

In the field of relationship detection, the following are the existing state of the art advances:

1) Framing the task of identifying objects using localization from referential expressions, detection of human-object interactions, or the more general tasks of Visual Relationship Detection (VRD) and scene graph generation.

2) Visual relationship detection methods like message passing RNNs and predicting over triplets of bounding boxes.

In the field of associative embedding, the following are some interesting applications:

1) Vector embeddings to group together body joints for multi-person pose estimation.

2) Vector embeddings to detect body joints of the various people in an image.

Reference Figure from the paper "Associative embedding: End-to-end learning for joint detection and grouping."

[[File:Oct30_associative_embedding_appendix_fig2.jpg | center]]

== The Architecture: ==
: '''1. Detecting Graph Elements'''

Given an image of dimensions h x w, a stacked hourglass (Appendix 2) is used to generate a h x w x f representation of the image. It should be noted that the dimension of the output (which is non-trainable),needs to fulfill certain criteria. Specifically, we need to have a resolution large enough to minimize the number of pixels with multiple detections while also being small enough to ensure that each 1 x 1 x f vector still contains the information needed for subsequent inference.

A 1x1 convolution and sigmoid activation is performed on this result to generate a heat map (one for objects and one for relationships, using separately determined convolutions). The value at a given pixel can be interpreted as the likelihood of detection at that particular pixel in the original image.

In order to claim that there is an element at some pixel, we need to have some likelihood threshold. Then, if a given pixel in the map has a value >= the threshold, we claim that there is an element at that pixel. This threshold is calculated by using binary cross-entropy loss on the final values in the heat map. Values with likelihoods greater than p-hat will be considered element detections.

Finally, for each element that we detected, we extract the 1 x 1 x f feature vector. This is then used as an input to a set of Feed Forward Neural Networks (FFNNs), where we have a separate network for each characteristic of interest, and for each network, there's one hidden layer with f nodes. The object class and relationship (edges) could be supervised by softmax loss. Furthermore, in order to predict the bounding box of the object, we can use the approach proposed by the Faster-RCNN model[3]. The following image summarizes the process.

[[File:Extraction Process.PNG|center]]

:'''2. Connecting Elements with Associative Embeddings'''
As explained earlier, to construct the scene graph, we need to know the source and destination of each edge. This is done through associative embeddings.

First, let us define an embedding hi ϵ Rd produced for some vector i, and let us assume that we have n object detections in a particular image. Now, define hik, for k = 1 to Ki (where Ki is the number of edges in the graph with a vertex at vertex i) as the embedding associated with an edge that touches vertex i. We define two loss functions on these sets.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 1.PNG]]</div>

The goal of Lpull is to minimize the squared differences between the embedding of a given vertex and the embedding of an edge that references said vertex.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 2.PNG]]</div>

On the other hand, minimizing Lpush implies assigning embeddings to vertices that are as far apart as possible. The further apart they are, the lower the output of max becomes until eventually, it reaches 0. Here, m is just a constant. In the paper, the values used were m = 8 and d = 8 (that is, 8D embeddings). Combining these two loss functions (and weighing them equally), accomplishes the task of predicting embeddings such that vertices are differentiated, but the embedding of a vertex is most similar to the vertex it references.

:'''3. Support for Overlapping Detections'''
An obvious concern is how the network would operate if there was more than one detection (be it object or relationship), in a given pixel. For example, detection of “shirt” and “person” may be centered at the exact same pixel. To account for this, the architecture is modified to allow for “slots” at each pixel. Specifically, so detections of objects are allowed at a particular pixel, while sr relationship detections are allowed at a given pixel.

In order to allow for this, some changes are required after the feature extraction step. Specifically, we now use the 1x1xf vector as the input for so (or sr) different sets of 4 FFNNs, where the output (of the first three) is as shown in figure 2, and with the final FFNN outputting the probability of a detection existing in that particular slot, at that particular pixel. This new network is trained exclusively on whether or not a detection has been made in that slot, and, in prediction, is used to determine the number of slots to output at a given pixel. It is critical to note that this each of these so (or sr) sets of FFNNs share absolutely no weights. And each is trained for detection in its assigned slot.

It is important to note that this implies a change in the training procedure. We now have so (or sr) different predictions (be it class, or class + bounding box), that we need to match with our set of ground truth detections at a given pixel. Without this step, we would not be able to assign a value to the error for that sample. To do this, we match a one-hot encoded vector of the ground-truth class and bounding box anchor (the reference vector), and then match them with the so (or sr) outputs provided at a given pixel. The Hungarian method is used to ensure maximum matching between the outputs and the reference method while ensuring we do not assign the same detection to multiple slots.

==Results==
A quick note on notation: R@50 indicates what percentage of ground-truth subject-predicate-object tuples appeared in a proposal of 50 such tuples. Since R@100 offers more possibilities, it will necessarily be higher. The 6.7, for example, indicates that 6.7% of the ground truth tuples appeared in the proposals of the network.

The authors tested the network against two other architectures designed to develop a semantic understanding of images. For this, they used the Visual Genome dataset, with so = 3 and sr = 6. Overall, the new architecture vastly outperformed past models. The results were as follows:

The table can be interpreted as follows:

[[File:Results Table.PNG|center|600px]]

::'''SGGen (no RPN)''': Given a particular image, without the use of Region Proposal networks, the accuracy of the proposed scene graph. No class predictions are provided.
::'''SGGen (with RPN)''': Same as above, except the output of the Region Proposal Network, is used to enhance the input of a given image. No class predictions are provided.
::'''SGCIs''': Ground-truth object bounding boxes are provided. The network is asked to classify them and determine relationships.
::'''PredCIs''': As above, except the classes are also provided. The only goal is to predict relationships.

Further analysis into the accuracy, when looking at predicates individually, shows that the architecture is very sensitive to over-represented relationship predicates.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Results - Part 2.PNG]]</div>

As shown in Figure 5, for many ground-truth predicates (those that do not appear often in the ground truth), the network does poorly. Even when allowed to propose 100 tuples, the network does not offer the predicate. Figure 4 simply observes the fact that certain sets of relationship predicates appear predominantly in a subset of slots. No general explanation has been offered for this behavior.

== Conclusion ==
In conclusion, the paper offers a novel approach that enables the extraction of image semantics while perpetually reasoning over the entire context of the image. Associative embeddings are used to connect object and predicate relationships, and parallel “slots” allow for multiple detections in one pixel. While this approach offers noticeable improvements in accuracy, it is clear that work needs to be done to account for the non-uniform distributions of relationships in the dataset.

== Critiques ==

The paper's contributions towards patterning unordered network outputs and using associative embeddings for connecting vertices and edges are commendable. However, it should be noted this paper is only an incremental improvement over existing well-studied architectures like the hour glass architecture. The modifications also seem to be hacky. The authors say that they make a slight modification to the hourglass design and double the number of features and weight all the loses equally. No scientific justification for why this is needed is given. Also the choice of constants to be 3 and 6 for <math display = "inline"> s_o</math> and <math display = "inline"> s_r</math> is not clear, as the authors leave out a fraction of the cases. I am not sure if the changes made are truly a critical advance as the experiments are conducted only on a single dataset and no generalizability arguments are made by the authors. So the methods might just work well only for this dataset and the changes may pertain to only this one. The theoretical analysis done in the paper comes directly from the hourglass literature and cannot be accounted for novelty.
== Appendices ==

'''Appendix 1: Sample Outputs'''
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Sample Pixel Graph Outputs.PNG]]</div>

'''Appendix 2: Stacked Hourglass Architecture'''
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Hourglass.PNG]]</div>

Although this goes beyond the focus of the paper, I would like to add a brief overview of the stacked hourglass architecture used to generate the heat map. This architecture is unique in that it allows cyclical top-down, bottom-up inference and recombination of features. While most architectures focus on optimizing the bottom-up portion (reducing dimensionality), the stacked-hourglass gives the network more flexibility in how it generates a representation by allowing it to learn a series of down-sampling / up-sampling steps.

== References ==
1. Alejandro Newell and Jia Deng, “Pixels to Graphs by Associative Embedding,” in NIPS, 2017

2. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016

3. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, pages 91–99, 2015.

File:Oct30 associative embedding appendix fig2.JPG

2018-11-03T03:21:04Z

H454chen: Figure from Alejandro Newell and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424, 2016.

Figure from
Alejandro Newell and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424, 2016.

File:DeepVO Presentation Henry.pdf

2018-11-03T03:16:41Z

H454chen: University of Waterloo Fall 2018 STAT948 Presentation

University of Waterloo
Fall 2018 STAT948
Presentation

stat946F18

2018-11-03T03:15:46Z

H454chen:

== [[F18-STAT946-Proposal| Project Proposal ]] ==

=Paper presentation=

[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]

= Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]
|-
|Oct 25 || Dhruv Kumar || 1 || Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs || [https://openreview.net/pdf?id=rkRwGg-0Z Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs Summary]
|-
|Oct 25 || Amirpasha Ghabussi || 2 || DCN+: Mixed Objective And Deep Residual Coattention for Question Answering || [https://openreview.net/pdf?id=H1meywxRW Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering Summary]
|-
|Oct 25 || Juan Carrillo || 3 || Hierarchical Representations for Efficient Architecture Search || [https://arxiv.org/abs/1711.00436 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search Summary]
[https://wiki.math.uwaterloo.ca/statwiki/images/1/15/HierarchicalRep-slides.pdf Slides]
|-
|Oct 30 || Manpreet Singh Minhas || 4 || End-to-end Active Object Tracking via Reinforcement Learning || [http://proceedings.mlr.press/v80/luo18a/luo18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End_to_end_Active_Object_Tracking_via_Reinforcement_Learning Summary]
|-
|Oct 30 || Marvin Pafla || 5 || Fairness Without Demographics in Repeated Loss Minimization || [http://proceedings.mlr.press/v80/hashimoto18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization Summary]
|-
|Oct 30 || Glen Chalatov || 6 || Pixels to Graphs by Associative Embedding || [http://papers.nips.cc/paper/6812-pixels-to-graphs-by-associative-embedding Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding Summary]
|-
|Nov 1 || Sriram Ganapathi Subramanian || 7 ||Differentiable plasticity: training plastic neural networks with backpropagation || [http://proceedings.mlr.press/v80/miconi18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity Summary]
[https://wiki.math.uwaterloo.ca/statwiki/images/3/3c/Deep_learning_course_presentation.pdf Slides]
|-
|Nov 1 || Hadi Nekoei || 8 || Synthesizing Programs for Images using Reinforced Adversarial Learning || [http://proceedings.mlr.press/v80/ganin18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning Summary]
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Synthesizing_Programs_for_Images_using_Reinforced_Adversarial_Learning.pdf Slides]
|-
|Nov 1 || Henry Chen || 9 || DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks || [https://ieeexplore.ieee.org/abstract/document/7989236 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DeepVO_Towards_end_to_end_visual_odometry_with_deep_RNN Summary]
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DeepVO_Presentation_Henry.pdf Slides]
|-
|Nov 6 || Nargess Heydari || 10 ||Wavelet Pooling For Convolutional Neural Networks Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks Summary]
|-
|Nov 6 || Aravind Ravi || 11 || Towards Image Understanding from Deep Compression Without Decoding || [https://openreview.net/forum?id=HkXWCMbRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding Summary]
|-
|Nov 6 || Ronald Feng || 12 || Learning to Teach || [https://openreview.net/pdf?id=HJewuJWCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach Summary]
|-
|Nov 8 || Neel Bhatt || 13 || Annotating Object Instances with a Polygon-RNN || [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN Summary]
|-
|Nov 8 || Jacob Manuel || 14 || Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels || [https://arxiv.org/pdf/1804.06872.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching Summary]
|-
|Nov 8 || Charupriya Sharma|| 15 || Tighter Variational Bounds are Not Necessarily Better || [https://arxiv.org/pdf/1802.04537.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Tighter_Variational_Bounds_are_Not_Necessarily_Better Summary]
|-
|NOv 13 || Sagar Rajendran || 16 || Zero-Shot Visual Imitation || [https://openreview.net/pdf?id=BkisuzWRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation Summary]
|-
|Nov 13 || Jiazhen Chen || 17 || || ||
|-
|Nov 13 || Neil Budnarain || 18 || Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data || [https://openreview.net/pdf?id=ryBnUWb0b Paper] ||
|-
|NOv 15 || Zheng Ma || 19 || Reinforcement Learning of Theorem Proving || [https://arxiv.org/abs/1805.07563 Paper] ||
|-
|Nov 15 || Abdul Khader Naik || 20 || || ||
|-
|Nov 15 || Johra Muhammad Moosa || 21 || Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin || [https://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin.pdf Paper] ||
|-
|NOv 20 || Zahra Rezapour Siahgourabi || 22 || || ||
|-
|Nov 20 || Shubham Koundinya || 23 || || ||
|-
|Nov 20 || Salman Khan || 24 || Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples || [http://proceedings.mlr.press/v80/athalye18a.html paper] ||
|-
|NOv 22 ||Soroush Ameli || 25 || Learning to Navigate in Cities Without a Map || [https://arxiv.org/abs/1804.00168 paper] ||
|-
|Nov 22 ||Ivan Li || 26 || Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction || [https://arxiv.org/pdf/1802.05451v3.pdf Paper] ||
|-
|Nov 22 ||Sigeng Chen || 27 || || ||
|-
|Nov 27 || Aileen Li || 28 || Spatially Transformed Adversarial Examples ||[https://openreview.net/pdf?id=HyydRMZC- Paper] ||
|-
|NOv 27 ||Xudong Peng || 29 || Multi-Scale Dense Networks for Resource Efficient Image Classification || [https://openreview.net/pdf?id=Hk2aImxAb Paper] ||
|-
|Nov 27 ||Xinyue Zhang || 30 || An Inference-Based Policy Gradient Method for Learning Options || [http://proceedings.mlr.press/v80/smith18a/smith18a.pdf Paper] ||
|-
|NOv 29 ||Junyi Zhang || 31 || Autoregressive Convolutional Neural Networks for Asynchronous Time Series || [http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Paper] ||
|-
|Nov 29 ||Travis Bender || 32 || Automatic Goal Generation for Reinforcement Learning Agents || [http://proceedings.mlr.press/v80/florensa18a/florensa18a.pdf Paper] ||
|-
|Nov 29 ||Patrick Li || 33 || Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices || [https://www.cse.ust.hk/~huangzf/ICML18.pdf Paper] ||
|-
|Makeup || Ruijie Zhang || 34 || Searching for Efficient Multi-Scale Architectures for Dense Image Prediction || [https://arxiv.org/pdf/1809.04184.pdf Paper]||
|-
|Makeup || Ahmed Afify || 35 ||Don't Decay the Learning Rate, Increase the Batch Size || [https://openreview.net/pdf?id=B1Yy1BxCZ Paper]||
|-
|Makeup || Gaurav Sahu || 36 || TBD || ||
|-
|Makeup || Kashif Khan || 37 || Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] ||
|-
|Makeup || Shala Chen || 38 || A NEURAL REPRESENTATION OF SKETCH DRAWINGS || ||
|-
|Makeup || Ki Beom Lee || 39 || Detecting Statistical Interactions from Neural Network Weights|| [https://openreview.net/forum?id=ByOfBggRZ Paper] ||
|-
|Makeup || Wesley Fisher || 40 || Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling || [http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling Summary]

stat946F18

2018-11-01T16:14:18Z

H454chen:

== [[F18-STAT946-Proposal| Project Proposal ]] ==

=Paper presentation=

[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]

= Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|-
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]
|-
|Oct 25 || Dhruv Kumar || 1 || Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs || [https://openreview.net/pdf?id=rkRwGg-0Z Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs Summary]
|-
|Oct 25 || Amirpasha Ghabussi || 2 || DCN+: Mixed Objective And Deep Residual Coattention for Question Answering || [https://openreview.net/pdf?id=H1meywxRW Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering Summary]
|-
|Oct 25 || Juan Carrillo || 3 || Hierarchical Representations for Efficient Architecture Search || [https://arxiv.org/abs/1711.00436 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search Summary]
[https://wiki.math.uwaterloo.ca/statwiki/images/1/15/HierarchicalRep-slides.pdf Slides]
|-
|Oct 30 || Manpreet Singh Minhas || 4 || End-to-end Active Object Tracking via Reinforcement Learning || [http://proceedings.mlr.press/v80/luo18a/luo18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End_to_end_Active_Object_Tracking_via_Reinforcement_Learning Summary]
|-
|Oct 30 || Marvin Pafla || 5 || Fairness Without Demographics in Repeated Loss Minimization || [http://proceedings.mlr.press/v80/hashimoto18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization Summary]
|-
|Oct 30 || Glen Chalatov || 6 || Pixels to Graphs by Associative Embedding || [http://papers.nips.cc/paper/6812-pixels-to-graphs-by-associative-embedding Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding Summary]
|-
|Nov 1 || Sriram Ganapathi Subramanian || 7 ||Differentiable plasticity: training plastic neural networks with backpropagation || [http://proceedings.mlr.press/v80/miconi18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity Summary]
|-
|Nov 1 || Hadi Nekoei || 8 || Synthesizing Programs for Images using Reinforced Adversarial Learning || [http://proceedings.mlr.press/v80/ganin18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning Summary]
|-
|Nov 1 || Henry Chen || 9 || DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks || [https://ieeexplore.ieee.org/abstract/document/7989236 Paper] ||
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DeepVO_Towards_end_to_end_visual_odometry_with_deep_RNN Summary]
[https://docs.google.com/presentation/d/1-ix4Afx4o2A1CeofE9chAeIh40z8BmWtII7ZljGj1bk/edit?usp=sharing Slides]
|-
|Nov 6 || Nargess Heydari || 10 ||Wavelet Pooling For Convolutional Neural Networks Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks Summary]
|-
|Nov 6 || Aravind Ravi || 11 || Towards Image Understanding from Deep Compression Without Decoding || [https://openreview.net/forum?id=HkXWCMbRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding Summary]
|-
|Nov 6 || Ronald Feng || 12 || Learning to Teach || [https://openreview.net/pdf?id=HJewuJWCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach Summary]
|-
|Nov 8 || Neel Bhatt || 13 || Annotating Object Instances with a Polygon-RNN || [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf Paper] ||
|-
|Nov 8 || Jacob Manuel || 14 || Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels || [https://arxiv.org/pdf/1804.06872.pdf Paper] ||
|-
|Nov 8 || Charupriya Sharma|| 15 || Tighter Variational Bounds are Not Necessarily Better || [https://arxiv.org/pdf/1802.04537.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Tighter_Variational_Bounds_are_Not_Necessarily_Better Summary]
|-
|NOv 13 || Sagar Rajendran || 16 || Zero-Shot Visual Imitation || [https://openreview.net/pdf?id=BkisuzWRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation Summary]
|-
|Nov 13 || Jiazhen Chen || 17 || || ||
|-
|Nov 13 || Neil Budnarain || 18 || Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data || [https://openreview.net/pdf?id=ryBnUWb0b Paper] ||
|-
|NOv 15 || Zheng Ma || 19 || Reinforcement Learning of Theorem Proving || [https://arxiv.org/abs/1805.07563 Paper] ||
|-
|Nov 15 || Abdul Khader Naik || 20 || || ||
|-
|Nov 15 || Johra Muhammad Moosa || 21 || Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin || [https://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin.pdf Paper] ||
|-
|NOv 20 || Zahra Rezapour Siahgourabi || 22 || || ||
|-
|Nov 20 || Shubham Koundinya || 23 || || ||
|-
|Nov 20 || Salman Khan || 24 || Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples || [http://proceedings.mlr.press/v80/athalye18a.html paper] ||
|-
|NOv 22 ||Soroush Ameli || 25 || Learning to Navigate in Cities Without a Map || [https://arxiv.org/abs/1804.00168 paper] ||
|-
|Nov 22 ||Ivan Li || 26 || Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction || [https://arxiv.org/pdf/1802.05451v3.pdf Paper] ||
|-
|Nov 22 ||Sigeng Chen || 27 || || ||
|-
|Nov 27 || Aileen Li || 28 || Spatially Transformed Adversarial Examples ||[https://openreview.net/pdf?id=HyydRMZC- Paper] ||
|-
|NOv 27 ||Xudong Peng || 29 || Multi-Scale Dense Networks for Resource Efficient Image Classification || [https://openreview.net/pdf?id=Hk2aImxAb Paper] ||
|-
|Nov 27 ||Xinyue Zhang || 30 || An Inference-Based Policy Gradient Method for Learning Options || [http://proceedings.mlr.press/v80/smith18a/smith18a.pdf Paper] ||
|-
|NOv 29 ||Junyi Zhang || 31 || Autoregressive Convolutional Neural Networks for Asynchronous Time Series || [http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Paper] ||
|-
|Nov 29 ||Travis Bender || 32 || Automatic Goal Generation for Reinforcement Learning Agents || [http://proceedings.mlr.press/v80/florensa18a/florensa18a.pdf Paper] ||
|-
|Nov 29 ||Patrick Li || 33 || Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices || [https://www.cse.ust.hk/~huangzf/ICML18.pdf Paper] ||
|-
|Makeup || Ruijie Zhang || 34 || Searching for Efficient Multi-Scale Architectures for Dense Image Prediction || [https://arxiv.org/pdf/1809.04184.pdf Paper]||
|-
|Makeup || Ahmed Afify || 35 ||Don't Decay the Learning Rate, Increase the Batch Size || [https://openreview.net/pdf?id=B1Yy1BxCZ Paper]||
|-
|Makeup || Gaurav Sahu || 36 || TBD || ||
|-
|Makeup || Kashif Khan || 37 || Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] ||
|-
|Makeup || Shala Chen || 38 || A NEURAL REPRESENTATION OF SKETCH DRAWINGS || ||
|-
|Makeup || Ki Beom Lee || 39 || Detecting Statistical Interactions from Neural Network Weights|| [https://openreview.net/forum?id=ByOfBggRZ Paper] ||
|-
|Makeup || Wesley Fisher || 40 || Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling || [http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling Summary]

DeepVO Towards end to end visual odometry with deep RNN

2018-10-30T03:52:46Z

H454chen: Added Critiques

== Introduction ==
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization”, with notable applications on the Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue mainstream geometry based monocular VO methods should be amended with deep learning approaches. Subsequently, the paper proposes a novel deep-learning based end-to-end VO algorithm, and then empirically demonstrates its viability.

== Related Work ==

Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation are obtained by identifying reference points and calculating how those points change over the image sequence. Algorithms in this category can be further divided into two sparse feature based methods and direct methods, which differ by how they select reference points. On the one hand, sparse feature based methods establish reference points using image salient features, such as corners and edges [8]. Direct methods, on the other hand, make use of the whole image and consider every pixel as a reference point [11]. Recenly, semi-direct methods that combine the benefits of both approaches are gaining popularity [16].

Today, most of state-of-the-art VO algorithms belong to the geometry family. However, they suffer significant limitations. For example, direct methods assume “photometric consistency” [11]. Sparse feature methods are also prone to “drifting” because of outliers and noises. As a result, the paper argues that geometry-based methods are difficult to engineer and calibrate, limiting its practicality. Figure 1 illustrates the general architecture of geometry-based algorithms, and it outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment).

[[File:DeepVO_Figure_1.png]]

Figure 1. Architectures of the conventional geometry-based monocular VO method.

The second category of VO algorithms is based on learning. Namely, they try to learn an object’s motion model from labeled optical flows. Initially, these models are trained using classic Machine Learning techniques such as KNN [15], Gaussian Process [16], and Support Vector Machines[17]. However, these models were inefficient to handle highly non-linear and high-dimensional inputs, leading to poor performance in comparison with geometry-based methods. For this reason, Deep Learning-based approaches are dominating research in this field and producing many promising results. For example, CNN based models can now recognize places based on appearance [18] and detect direction and velocity from stereo inputs [20]. Moreover, a deep learning model even achieved robust VO with blurred and under-exposed images [21]. While these successes are encouraging, the authors observe that a CNN based architecture is “incapable of modeling sequential information”. Instead, they proposed to use RNN to tackle this problem.

== End-to-End Visual odometry through RCNN ==

=== Architecture Overview ===
An end-to-end monocular VO model is proposed by utilizing deep Recurrence Convolutional Neural Network (RCNN). Figure 2 depicts the end-to-end model, which is comprised of three main stages. First, the model takes a monocular video as input and pre-processes the image sequences by “subtracting the mean RGB values of all frames” from each frame. Then, consecutive image sequences are stacked to form tensors, which become the inputs for the CNN stage. The purpose of the CNN stages is to extract salient features from the image tensors. The structure of the CNN is inspired by FlowNet [24], which is a model design to extract optical flows. Details of the CNN structure is shown in Table 1. Using CNN optical flow features as input, the RNN stage tries to estimate the temporal and sequential relations among the features. The RNN stage does this by utilizing two Long Short-Term Memory networks (LSTM), which estimate object poses for each time step using both long-term and short-term dependencies. Figure 3 illustrated the RNN architecture.

[[File:DeepVO_Figure_2.png]]

Figure 2. Architectures of the proposed RCNN based monocular VO system.

[[File:DeepVO_Table_1.png]]

Table 1. CNN structure

[[File:DeepVO_Figure_3.png]]

Figure 3. Folded and unfolded LSTMs and its internal structure.

=== Training and Optimisation ===
The proposed RCNN model can be represented as a conditional probability of poses given an image sequence: p(Yt|Xt) = p(y1,...,yt|x1,...,xt). Given this probability function is expressed by a deep RCNN, the problem can be interpreted as finding the hyperparameters or network weights that minimize the loss function between actual and predicted poses. Such that, “the loss function is composed of Mean Square Error (MSE) of all positions and orientations”.

== Experiments and Results ==
The paper evaluates the proposed RCNN VO model by comparing it empirically with the open-source VO library of LIBVISO2 [7], which is a well-known geometry based model. The comparison is done using the KITTI VO/SLAM benchmark [3], which contains 22 image sequences, 11 of which are labeled with ground truths. Two separate experiments are performed.

1. Quantitatively Analysis is performed using only labeled image sequence. Namely, 4 of 11 image sequences were used for training and the others reserved for testing. Table 2 and Figure 6 outlines the result, showing that the proposed RCNN model performs consistently better than the monocular VISO2_M model. However, it performs worse than the stereo VISO2_S model.

[[File:DeepVO_Table_2.png]]

[[File:DeepVO_Figure_6.png]]

2. The generalizability of the proposed RCNN model is evaluated using the unlabeled image sequences. Figure 8 outlines the test result, showing that the proposed model is able to generalize better than the monocular VISO2_M model and performs roughly the same as the stereo VISO2_S model.

[[File:DeepVO_Figure_8.png]]

== Conclusions ==
The paper presents a new RCNN VO model that combines the CNNs with the RNNs. Although it is considered a viable approach, it is not expected to be a replacement to the classic geometry-based approach. The main contribution of the paper is threefold: 1) The authors demonstrate that the monocular VO problem can be addressed in an end-to-end fashion based on DL, i.e., directly estimating poses from raw RGB images. Neither prior knowledge nor parameter is needed to recover the absolute scale. 2) The authors propose a RCNN architecture enabling the DL based VO algorithm to be generalised to totally new environments by using the geometric feature representation learnt by the CNN. 3) Sequential dependence and complex motion dynamics
of an image sequence, which are of importance to the VO but cannot be explicitly or easily modelled by human, are implicitly encapsulated and automatically learnt by the RCNN.

== Critiques ==

This paper cannot be considered as a critical advance to the state of the art as the authors just suggest a method combining CNN and RNNs for the visual odometry problem. The authors also state that deep learning in terms of simple feed-forward Neural networks and CNNs has already been used in this problem. Only an RNN approach seems to have been not tried on this problem. The authors propose a combined RCNN and geometric-based approach towards the end of the paper. But, it is not intuitive how these two potentially very diverse methods could be combined. The authors also do not explain any proposed methods for the combination. The authors don't build a compelling case against the state of the art methods or convincingly prove the superiority of the RCNN or a combined method. For example, the RCNN and other state of the art geometry-based methods have a deficiency of getting lower accuracies when shown a large open area in the images as mentioned by the authors. The authors put forth some techniques to solve this problem for the geometry approaches but they state that they do not have a similar method for the deep learning based approaches. Thus, in such scenarios, the methods proposed by the authors don't seem to work at all.

The paper advances the field of deep-learning based VO by creating a pioneering end-to-end model that is capable of extracting features and learning sequential dynamics from monocular videos. While the new model clearly outperforms the LIBVISO2_M algorithm, it fails to demonstrate any advantage over the LIBVISO2_S algorithm. Hence, it makes one question whether the complexity of deep-learning based monocular VO methods is justified; and, whether robots or autonomous vehicles designers should opt for stereo visions as much as possible. Nonetheless, this end-to-end model is beneficial for situations where monocular VO is the only viable option. Furthermore, the paper could have benefited by including a qualitative comparison of the algorithm’s computation requirements, such as hardware specification, engineering time, and training time. Finally, the justification for input sequence pre-processing is unclear. Perhaps, future-works could involve adapting the model for real-time visual odometry.

== References ==
[1] S. Wang, R. Clark, H. Wen and N. Trigoni, "DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.

[2] M. Maimone, Y. Cheng, and L. Matthies, "Two years of Visual Odometry on the Mars Exploration Rovers," Journal of Field Robotics. 24 (3): 169–186, 2007.

[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[7] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011.

[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.

[15] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch, “Memory-based learning for visual odometry,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2008, pp. 47–52.

[16] V. Guizilini and F. Ramos, “Semi-parametric learning for visual odometry,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 526–546, 2013.

[17] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.

[18] N. Su ̈nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.

[20] A. Kendall, M. Grimes, and R. Cipolla, “Convolutional networks for real-time 6-DoF camera relocalization,” in Proceedings of International Conference on Computer Vision (ICCV), 2015.

[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.18–25, 2016.

[24] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2758–2766.

DeepVO Towards end to end visual odometry with deep RNN

2018-10-30T03:51:12Z

H454chen: Minor Revision of Experiments and Results

== Introduction ==
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization”, with notable applications on the Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue mainstream geometry based monocular VO methods should be amended with deep learning approaches. Subsequently, the paper proposes a novel deep-learning based end-to-end VO algorithm, and then empirically demonstrates its viability.

== Related Work ==

Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation are obtained by identifying reference points and calculating how those points change over the image sequence. Algorithms in this category can be further divided into two sparse feature based methods and direct methods, which differ by how they select reference points. On the one hand, sparse feature based methods establish reference points using image salient features, such as corners and edges [8]. Direct methods, on the other hand, make use of the whole image and consider every pixel as a reference point [11]. Recenly, semi-direct methods that combine the benefits of both approaches are gaining popularity [16].

Today, most of state-of-the-art VO algorithms belong to the geometry family. However, they suffer significant limitations. For example, direct methods assume “photometric consistency” [11]. Sparse feature methods are also prone to “drifting” because of outliers and noises. As a result, the paper argues that geometry-based methods are difficult to engineer and calibrate, limiting its practicality. Figure 1 illustrates the general architecture of geometry-based algorithms, and it outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment).

[[File:DeepVO_Figure_1.png]]

Figure 1. Architectures of the conventional geometry-based monocular VO method.

The second category of VO algorithms is based on learning. Namely, they try to learn an object’s motion model from labeled optical flows. Initially, these models are trained using classic Machine Learning techniques such as KNN [15], Gaussian Process [16], and Support Vector Machines[17]. However, these models were inefficient to handle highly non-linear and high-dimensional inputs, leading to poor performance in comparison with geometry-based methods. For this reason, Deep Learning-based approaches are dominating research in this field and producing many promising results. For example, CNN based models can now recognize places based on appearance [18] and detect direction and velocity from stereo inputs [20]. Moreover, a deep learning model even achieved robust VO with blurred and under-exposed images [21]. While these successes are encouraging, the authors observe that a CNN based architecture is “incapable of modeling sequential information”. Instead, they proposed to use RNN to tackle this problem.

== End-to-End Visual odometry through RCNN ==

=== Architecture Overview ===
An end-to-end monocular VO model is proposed by utilizing deep Recurrence Convolutional Neural Network (RCNN). Figure 2 depicts the end-to-end model, which is comprised of three main stages. First, the model takes a monocular video as input and pre-processes the image sequences by “subtracting the mean RGB values of all frames” from each frame. Then, consecutive image sequences are stacked to form tensors, which become the inputs for the CNN stage. The purpose of the CNN stages is to extract salient features from the image tensors. The structure of the CNN is inspired by FlowNet [24], which is a model design to extract optical flows. Details of the CNN structure is shown in Table 1. Using CNN optical flow features as input, the RNN stage tries to estimate the temporal and sequential relations among the features. The RNN stage does this by utilizing two Long Short-Term Memory networks (LSTM), which estimate object poses for each time step using both long-term and short-term dependencies. Figure 3 illustrated the RNN architecture.

[[File:DeepVO_Figure_2.png]]

Figure 2. Architectures of the proposed RCNN based monocular VO system.

[[File:DeepVO_Table_1.png]]

Table 1. CNN structure

[[File:DeepVO_Figure_3.png]]

Figure 3. Folded and unfolded LSTMs and its internal structure.

=== Training and Optimisation ===
The proposed RCNN model can be represented as a conditional probability of poses given an image sequence: p(Yt|Xt) = p(y1,...,yt|x1,...,xt). Given this probability function is expressed by a deep RCNN, the problem can be interpreted as finding the hyperparameters or network weights that minimize the loss function between actual and predicted poses. Such that, “the loss function is composed of Mean Square Error (MSE) of all positions and orientations”.

== Experiments and Results ==
The paper evaluates the proposed RCNN VO model by comparing it empirically with the open-source VO library of LIBVISO2 [7], which is a well-known geometry based model. The comparison is done using the KITTI VO/SLAM benchmark [3], which contains 22 image sequences, 11 of which are labeled with ground truths. Two separate experiments are performed.

1. Quantitatively Analysis is performed using only labeled image sequence. Namely, 4 of 11 image sequences were used for training and the others reserved for testing. Table 2 and Figure 6 outlines the result, showing that the proposed RCNN model performs consistently better than the monocular VISO2_M model. However, it performs worse than the stereo VISO2_S model.

[[File:DeepVO_Table_2.png]]

[[File:DeepVO_Figure_6.png]]

2. The generalizability of the proposed RCNN model is evaluated using the unlabeled image sequences. Figure 8 outlines the test result, showing that the proposed model is able to generalize better than the monocular VISO2_M model and performs roughly the same as the stereo VISO2_S model.

[[File:DeepVO_Figure_8.png]]

== Conclusions ==
The paper presents a new RCNN VO model that combines the CNNs with the RNNs. Although it is considered a viable approach, it is not expected to be a replacement to the classic geometry-based approach. The main contribution of the paper is threefold: 1) The authors demonstrate that the monocular VO problem can be addressed in an end-to-end fashion based on DL, i.e., directly estimating poses from raw RGB images. Neither prior knowledge nor parameter is needed to recover the absolute scale. 2) The authors propose a RCNN architecture enabling the DL based VO algorithm to be generalised to totally new environments by using the geometric feature representation learnt by the CNN. 3) Sequential dependence and complex motion dynamics
of an image sequence, which are of importance to the VO but cannot be explicitly or easily modelled by human, are implicitly encapsulated and automatically learnt by the RCNN.

== Critiques ==

This paper cannot be considered as a critical advance to the state of the art as the authors just suggest a method combining CNN and RNNs for the visual odometry problem. The authors also state that deep learning in terms of simple feed-forward Neural networks and CNNs has already been used in this problem. Only an RNN approach seems to have been not tried on this problem. The authors propose a combined RCNN and geometric-based approach towards the end of the paper. But, it is not intuitive how these two potentially very diverse methods could be combined. The authors also do not explain any proposed methods for the combination. The authors don't build a compelling case against the state of the art methods or convincingly prove the superiority of the RCNN or a combined method. For example, the RCNN and other state of the art geometry-based methods have a deficiency of getting lower accuracies when shown a large open area in the images as mentioned by the authors. The authors put forth some techniques to solve this problem for the geometry approaches but they state that they do not have a similar method for the deep learning based approaches. Thus, in such scenarios, the methods proposed by the authors don't seem to work at all.

== References ==
[1] S. Wang, R. Clark, H. Wen and N. Trigoni, "DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.

[2] M. Maimone, Y. Cheng, and L. Matthies, "Two years of Visual Odometry on the Mars Exploration Rovers," Journal of Field Robotics. 24 (3): 169–186, 2007.

[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[7] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011.

[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.

[15] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch, “Memory-based learning for visual odometry,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2008, pp. 47–52.

[16] V. Guizilini and F. Ramos, “Semi-parametric learning for visual odometry,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 526–546, 2013.

[17] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.

[18] N. Su ̈nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.

[20] A. Kendall, M. Grimes, and R. Cipolla, “Convolutional networks for real-time 6-DoF camera relocalization,” in Proceedings of International Conference on Computer Vision (ICCV), 2015.

[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.18–25, 2016.

[24] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2758–2766.

DeepVO Towards end to end visual odometry with deep RNN

2018-10-30T03:48:28Z

H454chen: Minor Revision of Architecture Overview and Training and Optimisation

== Introduction ==
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization”, with notable applications on the Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue mainstream geometry based monocular VO methods should be amended with deep learning approaches. Subsequently, the paper proposes a novel deep-learning based end-to-end VO algorithm, and then empirically demonstrates its viability.

== Related Work ==

Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation are obtained by identifying reference points and calculating how those points change over the image sequence. Algorithms in this category can be further divided into two sparse feature based methods and direct methods, which differ by how they select reference points. On the one hand, sparse feature based methods establish reference points using image salient features, such as corners and edges [8]. Direct methods, on the other hand, make use of the whole image and consider every pixel as a reference point [11]. Recenly, semi-direct methods that combine the benefits of both approaches are gaining popularity [16].

Today, most of state-of-the-art VO algorithms belong to the geometry family. However, they suffer significant limitations. For example, direct methods assume “photometric consistency” [11]. Sparse feature methods are also prone to “drifting” because of outliers and noises. As a result, the paper argues that geometry-based methods are difficult to engineer and calibrate, limiting its practicality. Figure 1 illustrates the general architecture of geometry-based algorithms, and it outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment).

[[File:DeepVO_Figure_1.png]]

Figure 1. Architectures of the conventional geometry-based monocular VO method.

The second category of VO algorithms is based on learning. Namely, they try to learn an object’s motion model from labeled optical flows. Initially, these models are trained using classic Machine Learning techniques such as KNN [15], Gaussian Process [16], and Support Vector Machines[17]. However, these models were inefficient to handle highly non-linear and high-dimensional inputs, leading to poor performance in comparison with geometry-based methods. For this reason, Deep Learning-based approaches are dominating research in this field and producing many promising results. For example, CNN based models can now recognize places based on appearance [18] and detect direction and velocity from stereo inputs [20]. Moreover, a deep learning model even achieved robust VO with blurred and under-exposed images [21]. While these successes are encouraging, the authors observe that a CNN based architecture is “incapable of modeling sequential information”. Instead, they proposed to use RNN to tackle this problem.

== End-to-End Visual odometry through RCNN ==

=== Architecture Overview ===
An end-to-end monocular VO model is proposed by utilizing deep Recurrence Convolutional Neural Network (RCNN). Figure 2 depicts the end-to-end model, which is comprised of three main stages. First, the model takes a monocular video as input and pre-processes the image sequences by “subtracting the mean RGB values of all frames” from each frame. Then, consecutive image sequences are stacked to form tensors, which become the inputs for the CNN stage. The purpose of the CNN stages is to extract salient features from the image tensors. The structure of the CNN is inspired by FlowNet [24], which is a model design to extract optical flows. Details of the CNN structure is shown in Table 1. Using CNN optical flow features as input, the RNN stage tries to estimate the temporal and sequential relations among the features. The RNN stage does this by utilizing two Long Short-Term Memory networks (LSTM), which estimate object poses for each time step using both long-term and short-term dependencies. Figure 3 illustrated the RNN architecture.

[[File:DeepVO_Figure_2.png]]

Figure 2. Architectures of the proposed RCNN based monocular VO system.

[[File:DeepVO_Table_1.png]]

Table 1. CNN structure

[[File:DeepVO_Figure_3.png]]

Figure 3. Folded and unfolded LSTMs and its internal structure.

=== Training and Optimisation ===
The proposed RCNN model can be represented as a conditional probability of poses given an image sequence: p(Yt|Xt) = p(y1,...,yt|x1,...,xt). Given this probability function is expressed by a deep RCNN, the problem can be interpreted as finding the hyperparameters or network weights that minimize the loss function between actual and predicted poses. Such that, “the loss function is composed of Mean Square Error (MSE) of all positions and orientations”.

== Experiments and Results ==
The paper evaluated the proposed RCNN VO model by comparing it empirically with the open-source VO library of LIBVISO2 [7], which is a well-known geometry based model. The comparison is carried out using the KITTI VO/SLAM benchmark [3]. In total, the KITTI VO/SLAM benchmark contains 22 image sequences, 11 of which are labeled with ground truths. Two separate experiments are performed.

1. Quantitatively Analysis is performed using only labeled image sequence. Namely, 4 of those images sequences were used for training and the others for testing. Table 2 and Figure 6 outlines the result, and they show that the proposed RCNN model performs consistently better than the monocular VISO2_M model. However, it performs worse than the stereo VISO2_S model.

[[File:DeepVO_Table_2.png]]

[[File:DeepVO_Figure_6.png]]

2. The generalizability of the proposed RCNN model in a new environment is evaluated using unlabeled image sequences. Figure 8 outlines the result, and it shows that the proposed model is able to generalize better than the monocular VISO2_M model and performs roughly the same as the stereo VISO2_S model.

[[File:DeepVO_Figure_8.png]]

== Conclusions ==
The paper presents a new RCNN VO model that combines the CNNs with the RNNs. Although it is considered a viable approach, it is not expected to be a replacement to the classic geometry-based approach. The main contribution of the paper is threefold: 1) The authors demonstrate that the monocular VO problem can be addressed in an end-to-end fashion based on DL, i.e., directly estimating poses from raw RGB images. Neither prior knowledge nor parameter is needed to recover the absolute scale. 2) The authors propose a RCNN architecture enabling the DL based VO algorithm to be generalised to totally new environments by using the geometric feature representation learnt by the CNN. 3) Sequential dependence and complex motion dynamics
of an image sequence, which are of importance to the VO but cannot be explicitly or easily modelled by human, are implicitly encapsulated and automatically learnt by the RCNN.

== Critiques ==

This paper cannot be considered as a critical advance to the state of the art as the authors just suggest a method combining CNN and RNNs for the visual odometry problem. The authors also state that deep learning in terms of simple feed-forward Neural networks and CNNs has already been used in this problem. Only an RNN approach seems to have been not tried on this problem. The authors propose a combined RCNN and geometric-based approach towards the end of the paper. But, it is not intuitive how these two potentially very diverse methods could be combined. The authors also do not explain any proposed methods for the combination. The authors don't build a compelling case against the state of the art methods or convincingly prove the superiority of the RCNN or a combined method. For example, the RCNN and other state of the art geometry-based methods have a deficiency of getting lower accuracies when shown a large open area in the images as mentioned by the authors. The authors put forth some techniques to solve this problem for the geometry approaches but they state that they do not have a similar method for the deep learning based approaches. Thus, in such scenarios, the methods proposed by the authors don't seem to work at all.

== References ==
[1] S. Wang, R. Clark, H. Wen and N. Trigoni, "DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.

[2] M. Maimone, Y. Cheng, and L. Matthies, "Two years of Visual Odometry on the Mars Exploration Rovers," Journal of Field Robotics. 24 (3): 169–186, 2007.

[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[7] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011.

[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.

[15] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch, “Memory-based learning for visual odometry,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2008, pp. 47–52.

[16] V. Guizilini and F. Ramos, “Semi-parametric learning for visual odometry,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 526–546, 2013.

[17] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.

[18] N. Su ̈nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.

[20] A. Kendall, M. Grimes, and R. Cipolla, “Convolutional networks for real-time 6-DoF camera relocalization,” in Proceedings of International Conference on Computer Vision (ICCV), 2015.

[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.18–25, 2016.

[24] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2758–2766.

End to end Active Object Tracking via Reinforcement Learning

2018-10-29T21:57:22Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

:Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

:Long-Term Tracking is the task to recognize and track an object as it “moves in and out of a camera’s field of view”. This task is made difficult by problems such as an object reappearing into the scene and changing its appearance, scale, or illumination. Kalal et al. 2012 proposed a unified tracking framework (TLD) that accomplishes long-term tracking by “decomposing the task into tracking, learning, and detection”. Specifically, “the tracker follows an object from frame-to-frame; the detector localizes the object’s appearances; and, the learner improves the detector by learning from errors.” Altogether, the TLD framework outperforms previous state-of-arts tracking approaches [6].

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

:In recent year, Deep Learning approaches are gaining prominence in the field of object tracking. For example, Wang et al. 2013 obtain outstanding results using a deep-learning based algorithm that combines offline feature extraction and online tracking using stacked denoising autoencoders. Whereas, Wang et al. 2016 introduced a sequential training convolutional network that can efficiently transfer offline learned features for online visual tracking applications.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.

[6] Kalal, Zdenek, Mikolajczyk, Krystian, and Matas, Jiri. Tracking- learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.

[7] Wang, Naiyan and Yeung, Dit-Yan. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, 2013.

[8] Wang, Lijun, Ouyang, Wanli, Wang, Xiaogang, and Lu, Huchuan. Stct: Sequentially training convolutional networks for visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381, 2016.

End to end Active Object Tracking via Reinforcement Learning

2018-10-29T21:56:57Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

:Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

:Long-Term Tracking is the task to recognize and track an object as it “moves in and out of a camera’s field of view”. This task is made difficult by problems such as an object reappearing into the scene and changing its appearance, scale, or illumination. Kalal et al. 2012 proposed a unified tracking framework (TLD) that accomplishes long-term tracking by “decomposing the task into tracking, learning, and detection”. Specifically, “the tracker follows an object from frame-to-frame; the detector localizes the object’s appearances; and, the learner improves the detector by learning from errors.” Altogether, the TLD framework outperforms previous state-of-arts tracking approaches [6].

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.

[6] Kalal, Zdenek, Mikolajczyk, Krystian, and Matas, Jiri. Tracking- learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.

[7] Wang, Naiyan and Yeung, Dit-Yan. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, 2013.

[8] Wang, Lijun, Ouyang, Wanli, Wang, Xiaogang, and Lu, Huchuan. Stct: Sequentially training convolutional networks for visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381, 2016.

End to end Active Object Tracking via Reinforcement Learning

2018-10-29T21:54:31Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

:Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.

End to end Active Object Tracking via Reinforcement Learning

2018-10-28T05:48:49Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they argue tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating an novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

End to end Active Object Tracking via Reinforcement Learning

2018-10-28T05:46:50Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking.

2) Multiple instance learning was employed to track an object.

:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they argue tracking algorithms using weaker classifiers can still obtain superior performance.

3) Correlation filter based object tracking has achieved success in real-time object tracking.

:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating an novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter.

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

End to end Active Object Tracking via Reinforcement Learning

2018-10-28T05:25:15Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking.

2) Multiple instance learning was employed to track an object.

:Many research have shown that a tracking algorithm can achieve better performance by employing an adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models are often difficult to update. So, Babenko et al. introduce an novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they argue tracking alogorithms using weaker classifiers can still obtain superior performance.

3) Correlation filter based object tracking has achieved success in real-time object tracking.

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

End to end Active Object Tracking via Reinforcement Learning

2018-10-28T05:10:43Z

H454chen:

=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.
The process normally consists of the following steps.
<ol>
<li> Taking an initial set of object detections. </li>
<li> Creating and assigning a unique ID for each of the initial detections. </li>
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>

[[File:active_tracking_pipeline.PNG|500px|center]]

Passive tracking assumes that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Although passive tracking is very useful and much of the existing work has been done on it, it is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.
On the other hand, active tracking involves two subtasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate subtasks. Tracking may involve many human efforts for bounding box labeling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in the real world.

To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a ﬁrst-person perspective and takes an action, and then the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].

=Intuition=

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking sub modules. Now we can train the entire system together as the error needs to be propogated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in case of DQN). The training of these CNN happens concurrently with the Q feed forward networks where the error function is the difference between the observed Q value and the target Q values.

=Related Work=

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduces a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking.

2) Multiple instance learning was employed to track an object.

3) Correlation filter based object tracking has achieved success in real-time object tracking.

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.

\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]

The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
===ViZDoom===
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, ﬂoor, and wall). The monster walks along a pre-speciﬁed path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.
[[File:fig4.PNG|500px|center]]

===Unreal Engine===
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad inﬂuence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==A3C Algorithm==
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math> denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression.
[[File:equation12.PNG|500px|center]]
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.

==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture speciﬁcation is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
[[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
[[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further ﬂipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
=Experimental Results=
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-speciﬁed, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difﬁculty to the tracking problem.
For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a ﬁxed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]
The testing environments results are reported in Tab. 2.
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon).
2. The tracker is insensitive to background variations such as changing the ceiling and ﬂoor (FloorCeiling) or placing additional walls in the map (Corridor).
3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly.
4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

[[File:table3.PNG|400px|center]]

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
[[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overﬁt to a specialized appearance.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difﬁcult to tune a uniﬁed camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

[[File:fig7.PNG|400px|center]]

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the ﬁgure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]

Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]

=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for bench-marking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in real world because the tracker cannot adapt to different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transfer ability tailored to real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

DeepVO Towards end to end visual odometry with deep RNN

2018-10-26T07:44:17Z

H454chen:

== Introduction ==
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization”, with notable applications on the Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue mainstream geometry based monocular VO methods should be amended with deep learning approaches. Subsequently, the paper proposes a novel deep-learning based end-to-end VO algorithm, and then empirically demonstrates its viability.

== Related Work ==

Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation are obtained by identifying reference points and calculating how those points change over the image sequence. Moreover, algorithms in this category can be divided into two sub-categories, which differ by how they select reference points. Namely, sparse feature based methods establish reference points using image salient features, such as corners and edges [8]. Whereas, direct methods make use of the whole image and consider every pixel as a reference point [11]. Furthermore, semi-direct methods that combine both approaches are recently gaining popularity [16].

Today, most of state-of-the-art VO algorithms belong to the geometry family. However, they have significant limitations. For example, direct methods assume “photometric consistency” [11]. Whereas, sparse feature methods are prone to “drifting” because of outliers and noises. As the result, the paper argues geometry-based methods are difficult to engineer and calibrate, thus limiting its practicality. Figure 1 illustrates the general architecture of geometry-based algorithms, and it outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment).

[[File:DeepVO_Figure_1.png]]

Figure 1. Architectures of the conventional geometry-based monocular VO method.

The second category of VO algorithms is based on learning. Namely, they try to learn an object’s “motion model’ from labeled optical flows. Initially, these models are trained using classic Machine Learning techniques such as KNN [15], Gaussian Process [16], and Support Vector Machines[17]. However, these models were inefficient to handle “highly non-linear and high dimensional” inputs, which cause them to perform poorly in comparison with geometry-based methods. For this reason, Deep Learning-based approaches are dominating research in this field and producing many promising results. For example, CNN based models can now recognize places based on appearance [18] and detect direction and velocity from stereo inputs [20]. Moreover, a deep learning model even achieves “robust VO with blurred and under-exposed images [21]”. While these successes are encouraging, the authors observe that a CNN based architecture is “incapable of modeling sequential information”. Instead, they propose to use RNN to tackle this problem.

== End-to-End Visual odometry through RCNN ==

=== Architecture Overview ===
An end-to-end monocular VO model is proposed by utilizing deep Recurrence Convolutional Neural Network (RCNN). Figure 2 depicts the end-to-end model, which is comprised of three main stages. First, the model takes a monocular video as input and it pre-processes the image sequence by “subtracting the mean RGB values” from each frame. Then, consecutive image frames are stacked to form tensors, which become the inputs for the CNN stage. The purpose of the CNN stages is to extract salient features from the input image. The structure of the CNN is inspired by FlowNet [24] and designed to model optimal flows. Details of the CNN structure is shown in Table 1. Using CNN features as input, the RNN stage tries to estimate the temporal and sequential relations among input features. The RNN network is composed of two Long Short-Term Memory networks (LSTM), which allows the network to make predictions based on long-term and short-term dependencies. Figure 3 illustrated the structure. In this way, the RCNN architecture allows for end-to-end pose estimation for each time step.

[[File:DeepVO_Figure_2.png]]

Figure 2. Architectures of the proposed RCNN based monocular VO system.

[[File:DeepVO_Table_1.png]]

Table 1. CNN structure

[[File:DeepVO_Figure_3.png]]

Figure 3. Folded and unfolded LSTMs and its internal structure.

=== Training and Optimisation ===
The proposed RCNN model can be represented as a conditional probability of poses given an image sequence: p(Yt|Xt) = p(y1,...,yt|x1,...,xt). Given this probability function is expressed as a deep RCNN, the problem can be interpreted as finding the hyperparameters or network weights that minimize the loss function between actual and predicted poses, in which the loss function, is chosen to be Mean Square Error (MSE) of all positions and orientation.

== Experiments and Results ==
The paper evaluated the proposed RCNN VO model by comparing it empirically with the open-source VO library of LIBVISO2 [7], which is a well-known geometry based model. The comparison is carried out using the KITTI VO/SLAM benchmark [3]. In total, the KITTI VO/SLAM benchmark contains 22 image sequences, 11 of which are labeled with ground truths. Two separate experiments are performed.

1. Quantitatively Analysis is performed using only labeled image sequence. Namely, 4 of those images sequences were used for training and the others for testing. Table 2 and Figure 6 outlines the result, and they show that the proposed RCNN model performs consistently better than the monocular VISO2_M model. However, it performs worse than the stereo VISO2_S model.

[[File:DeepVO_Table_2.png]]

[[File:DeepVO_Figure_6.png]]

2. The generalizability of the proposed RCNN model in a new environment is evaluated using unlabeled image sequences. Figure 8 outlines the result, and it shows that the proposed model is able to generalize better than the monocular VISO2_M model and performs roughly the same as the stereo VISO2_S model.

[[File:DeepVO_Figure_8.png]]

== Conclusions ==
The paper concludes that the proposed RCNN VO model is a viable approach. However, it is not expected as a replacement to the classic geometry-based approach.

== Critiques and Discussions ==

== References ==
[1] S. Wang, R. Clark, H. Wen and N. Trigoni, "DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.

[2] M. Maimone, Y. Cheng, and L. Matthies, "Two years of Visual Odometry on the Mars Exploration Rovers," Journal of Field Robotics. 24 (3): 169–186, 2007.

[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[7] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011.

[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.

[15] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch, “Memory-based learning for visual odometry,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2008, pp. 47–52.

[16] V. Guizilini and F. Ramos, “Semi-parametric learning for visual odometry,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 526–546, 2013.

[17] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.

[18] N. Su ̈nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.

[20] A. Kendall, M. Grimes, and R. Cipolla, “Convolutional networks for real-time 6-DoF camera relocalization,” in Proceedings of International Conference on Computer Vision (ICCV), 2015.

[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.18–25, 2016.

[24] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2758–2766.

DeepVO Towards end to end visual odometry with deep RNN

2018-10-26T07:38:55Z

H454chen:

DeepVO Towards end to end visual odometry with deep RNN

2018-10-26T07:35:29Z

H454chen:

DeepVO Towards end to end visual odometry with deep RNN

2018-10-26T07:34:41Z

H454chen:

File:DeepVO Figure 8.png

2018-10-26T07:33:19Z

H454chen:

File:DeepVO Figure 6.png

2018-10-26T07:33:06Z

H454chen:

File:DeepVO Table 2.png

2018-10-26T07:32:54Z

H454chen: