http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Msminhas&feedformat=atomstatwiki - User contributions [US]2021-09-19T02:31:57ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Autoregressive_Convolutional_Neural_Networks_for_Asynchronous_Time_Series&diff=42405stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series2018-12-12T01:55:26Z<p>Msminhas: Technical</p>
<hr />
<div>This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].<br />
<br />
=Introduction=<br />
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:<br />
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, <br />
# An artificially generated noisy auto-regressive series, <br />
# A UCI household electricity consumption dataset. <br />
<br />
This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.<br />
<br />
Time series forecasting is focused on modeling the predictors of future values of time series given their past. As in many cases the relationship between past and future observations is not deterministic, this amounts to expressing the conditional probability distribution as a function of the past observations: The time series forecasting problem can be expressed as a conditional probability distribution below,<br />
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div><br />
This forecasting problem has been approached almost independently by econometrics and machine learning communities. In this paper, the authors focus on modeling the predictors of future values of time series given their past values. <br />
<br />
The reasons that financial time series are particularly challenging:<br />
* Low signal-to-noise ratio and heavy-tailed distributions.<br />
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.<br />
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients). <br />
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.<br />
<br />
The predictability of financial dataset still remains an open problem and is discussed in various publications [2].<br />
<br />
[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]<br />
<br />
The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.<br />
<br />
=Related Work=<br />
===Time series forecasting===<br />
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecasted using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.<br />
<br />
Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through the time gate. The LSTM architecture has three "gates", the input gate, the forget gate, and the update gate. It performs well in practice because it allows the RNN architecture to be able to take into account events happened a long time ago. Traditionally, RNN architectures are heavily influenced by recent events, but LSTM overcomes that by updating the weights in the three newly introduced gates.<br />
<br />
In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.<br />
<br />
====AR Model====<br />
<br />
An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. It is a representation of a type of random process and is used to describe certain time varying processes. The AR model specifies that the output variable depends lienarly on tis own previous values and on a stochastic term which is imperfectly predictable. Tus the model is in the form of a stochastic difference equation. For a p-th order (relating the current state to the p last states), the equation of the model is:<br />
<br />
<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]<br />
<br />
With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.<br />
<br />
===Gating and weighting mechanisms===<br />
Gating mechanism for neural networks has ability to overcome the problem of vanishing gradients, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid non-linearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].<br />
<br />
The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.<br />
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs. <br />
<br />
This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as, the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are modelled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.<br />
<br />
=Motivation=<br />
There are mainly five motivations that are stated in the paper by the authors:<br />
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.<br />
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .<br />
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.<br />
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).<br />
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.<br />
<br />
[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]<br />
<br />
=Model Architecture=<br />
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math><br />
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div><br />
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.<br />
<br />
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. <br />
<br />
The estimator of <math>y_n</math> can be expressed as:<br />
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div><br />
The estimate is the summation of the columns of the matrix in bracket. Here<br />
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. <br />
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only. <br />
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> <br />
#** <math> W \in \mathbb{R}^{d_I \times M}</math> <br />
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.<br />
<br />
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math><br />
#* for any <math>a_{i} \in \mathbb{R}^{M}</math><br />
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.<br />
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).<br />
#<math>A.,_m</math> denotes the m-th column of a matrix A.<br />
<br />
Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div><br />
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.<br />
Figure 3 shows the scheme of network.<br />
<br />
[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]<br />
<br />
The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.<br />
<br />
===Relation to asynchronous data===<br />
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem<br />
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.<br />
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).<br />
<br />
===Loss function===<br />
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math><br />
<br />
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:<br />
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div><br />
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:<br />
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div><br />
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.<br />
<br />
=Experiments=<br />
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM, Phased LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary<br />
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here].<br />
<br />
==Datasets==<br />
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value. Note that a series with K sources is K + 1-dimensional in synchronous case and K + 2-dimensional in asynchronous case. The base series in all processes was a stationary AR(10) series. Although that series has the true order of 10, in the experimental setting the input data included past 60 observations. The rationale behind that is twofold: not only is the data observed in irregular random times but also in real–life problems the order of the model is unknown.<br />
<br />
Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.<br />
<br />
Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.<br />
<br />
[[File:async.png | 520px|center|]]<br />
<br />
==Training details==<br />
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.<br />
<br />
They chose LeakyReLU as activation function for all networks:<br />
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div><br />
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.<br />
<br />
Table 1 presents the configuration of network hyperparameters used in comparison<br />
<br />
[[File:Junyi4.png | 520px|center|]]<br />
<br />
===Network Training===<br />
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with a ratio of 3 to 1. The remaining 20% of the data was used as a test set.<br />
<br />
All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.<br />
<br />
They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation. <br />
<br />
At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.<br />
<br />
Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]<br />
<br />
The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets. The artificial and electricity data was optimized using one NVIDIA K20, while the quotes data used only an Intel Core i7-6700 CPU.<br />
<br />
==Results==<br />
Table 2 shows all results performed from all datasets.<br />
[[File:Junyi5.png | 800px|center|]]<br />
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on an artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have a negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on an asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible. This makes it hard to justify the introduction of the auxiliary loss function <math>L^{aux}</math>.<br />
<br />
Also, using artificial dataset as the experimental result is not a good practice in this paper. This is essentially an application paper, and such dataset makes results hard to reproduce, and cannot support the performance claim of the model.<br />
<br />
[[File:Junyi6.png | 480px|center|]]<br />
In general, SOCNN has a significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.<br />
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]<br />
<br />
Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 datasets and check how these networks perform. The result is shown in Figure 6.<br />
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]<br />
From Figure 6, the purple lines and green lines seem to stay at the same position in the training and testing process. SOCNN and single-layer LSTM are most robust and least prone to overfitting comparing to other networks.<br />
<br />
=Conclusion and Discussion=<br />
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks. <br />
<br />
The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.<br />
<br />
=Critiques=<br />
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.<br />
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.<br />
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.<br />
#The transform of the original data to asynchronous data is not clear.<br />
#The experiments on the main application are not reproducible because the data is proprietary.<br />
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.<br />
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness. It helped achieve more stable test error throughout training in many cases. <br />
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.<br />
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.<br />
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.<br />
#The paper does not specify how training and testing set are separated in detail, which is quite important in time-series problems. Moreover, rolling or online-based learning scheme should be used in comparison, since they are standard in time-series prediction tasks.<br />
<br />
=References=<br />
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994. <br />
<br />
[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.<br />
<br />
[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.<br />
<br />
[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.<br />
<br />
[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.<br />
<br />
[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.<br />
<br />
[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.<br />
<br />
[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.<br />
<br />
[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.<br />
<br />
[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.<br />
<br />
[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.<br />
<br />
[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.<br />
<br />
[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.<br />
<br />
[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Visual_Reinforcement_Learning_with_Imagined_Goals&diff=42404Visual Reinforcement Learning with Imagined Goals2018-12-12T01:47:18Z<p>Msminhas: Technical</p>
<hr />
<div>Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]<br />
<br />
=Introduction and Motivation=<br />
<br />
Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus are able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.<br />
<br />
In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract (self-generated) goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.<br />
<br />
The algorithm proposed by the authors is summarized below. A Variational Auto Encoder (VAE) on the (left) learns a latent representation of images gathered during training time (center). These latent variables are used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).<br />
<br />
[[File: WF_Sec_11Nov25_01.png |center| 800px]]<br />
<br />
=Related Work =<br />
<br />
Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviors such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al. [11] proposed time-varying models which require episodic setups and thus are hard to generalize to non-episodic and continuous learning scenarios. There are also other works such as Pinto et al. [12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] use fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. The authors' experiments indicate that this technique is difficult to extend to goal-conditioned setting<br />
with image inputs. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.<br />
<br />
In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions. This approach allows for a single transition tuple to be converted into potentially infinite valid training examples. <br />
<br />
Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.<br />
<br />
=Goal-Conditioned Reinforcement Learning=<br />
<br />
The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math> (desired state), can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> is the reward for performing action <math>a_i</math> when the current state is <math>s_i</math> and the goal state is <math>s_{i+1}</math> and <math>\gamma</math> is a discount factor which determines the relative importance given to rewards at different times. <br />
<br />
In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Suppose we let an autonomous agent explore an environment with a random policy. After executing each action, start and stop state observations are collected and stored. All state observations are images. For training, the agent can randomly select starting states and goals images from the set of state observations.<br />
<br />
Moreover, if we aim to accomplish a variety of tasks, we can construct a goal-conditioned policy and reward, and optimize the expected return with respect to a goal distribution <br />
<br />
<center><math>E_{g \sim G}[E_{r_i,s_i \sim E, a_i \sim \pi}[R_0]]</math></center><br />
<br />
where <math>G</math> is the set of goals and the reward is also a function of <math>g</math><br />
<br />
Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that a chosen value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to the goal state.<br />
<br />
[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]<br />
<br />
In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization<br />
<br />
<div align="center"><br />
<math>\pi(s,g) = max_a Q(s,a,g)</math><br />
</div><br />
<br />
which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.<br />
<br />
The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:<br />
<br />
[[File:ql.png|center|600px]]<br />
<br />
From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:<br />
<br />
<div align="center"><br />
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math><br />
</div><br />
<br />
where <math>\overline{w}</math> is treated as some constant.<br />
<br />
The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, <math> (s,a,s′)</math> . However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:<br />
<br />
[[File:qlr.png|center|600px]]<br />
<br />
This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns, as the task of generating goal images is fairly intensive.<br />
<br />
For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between the two images may not correlate with their semantic distance.<br />
<br />
Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.<br />
<br />
Retroactively generating goals is also explored in tabular domains in [15]and in continuous domains in [14] using hindsight experience replay (HER). However, HER is<br />
limited to sampling goals seen along a trajectory, which greatly limits the number and diversity of goals with which one can relabel a given transition.<br />
<br />
=Variational Autoencoder=<br />
Variational autoencoders can learn structured latent representations of high dimensional data. VAE contains an encoder <math>p_\phi</math> and a decoder <math>p_\psi</math>. The former maps states to latent distributions, while the later maps latents to distributions over states. these two are jointly trained to maximize:<br />
<br />
<math>L(\psi,\phi;s^{(i)})=-\beta D_{KL}(q_\phi(z|s^{(i)}||p(z))+E_{q\phi(z|s^(i))}[log p_\psi(s^{(i)})|z])</math><br />
<br />
where p(z) is a prior distribution, which is chosen to be unit Gaussian, <math>D_{KL}</math> is the Kullback-Leibler divergence, and <math>\beta</math> is a hyper-parameter that balances the two terms.<br />
<br />
This generative model <br />
converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images result in faster learning.<br />
<br />
[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]<br />
<br />
Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.<br />
<br />
This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.<br />
<br />
[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]<br />
<br />
The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.<br />
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.<br />
=Goal-Conditioned Policies with Unsupervised Representation Learning=<br />
The choice of a suitable goal representation is required for the devising of practical goal-conditioned value functions. When there is absence of domain specific knowledge and instrumentation, a choice is to set the goal space G to be the same as the state observation space S. However, when state is high-dimensional learning a goal-conditioned Q-function and policy becomes exceedingly difficult. One challenging problem with end-to-end approaches for visual RL tasks is that the resulting policy needs to learn both perception and control. Training the goal-conditioned value function requires defining a goal-conditioned reward.<br />
<br />
Their method jointly addresses a number of problems that arise when working with high-dimensional<br />
inputs such as images: sample efficient learning, reward specification, and automated goal-setting. These problems are addressed by learning a latent embedding using a <math>/beta - VAE</math>. This latent space is then used to represent the goal and state and retroactively relabel data with latent goals sampled from the VAE prior to improve sample efficiency. <br />
=Algorithm=<br />
[[File:algorithm1.png|center|thumb|600px|]]<br />
<br />
Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the equation <br />
<br />
<center><math display="inline"> r(s, g) = - || z - z_g ||_A \propto \sqrt{log(e_{\Phi}(z_g | s))} </math></center>.<br />
<br />
This equation is derived from the equation below. This is based on the choice to use the negative Mahalanobis distance in the latent space for the reward:<br />
<br />
<center><math display="inline"> r(s, g) = - || e(s) - e(g) ||_A = - || z - z_g ||_A </math></center><br />
<br />
=Experiments=<br />
<br />
The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.<br />
<br />
The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The Oracle, as a baseline, was given true object location information, as opposed to visual pixel information.<br />
<br />
[[File:WF_Sec_11Nov_25_02.png|1000px]]<br />
<br />
<br />
They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.<br />
<br />
Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:<br />
<br />
[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]<br />
<br />
The method for reaching only needs 10,000 samples and an hour of real-world interactions.<br />
<br />
They also used RIG to train a policy to push objects to target locations:<br />
<br />
[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is<br />
pictured, with frames from test rollouts of the learned policy.]]<br />
<br />
The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward. As learning proceeds, RIG makes steady progress at optimizing the latent distance.<br />
<br />
=Conclusion & Future Work=<br />
<br />
In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. A new paper [10] was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.<br />
<br />
=Critique=<br />
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.<br />
<br />
2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blurry to be used goal images. It would be better if this can be investigated in the future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information. <br />
<br />
3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.<br />
<br />
4. The instability mentioned in #2 is even more apparent in the multi-object scenario and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spend moving between the multiple objects in the scene (which it currently does quite frequently).<br />
<br />
=References=<br />
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric<br />
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.<br />
<br />
2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by<br />
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems<br />
(NIPS), 2016.<br />
<br />
3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan<br />
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International<br />
Conference on Learning Representations (ICLR), 2018.<br />
<br />
4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David<br />
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International<br />
Conference on Learning Representations (ICLR), 2016.<br />
<br />
5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew<br />
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement<br />
learning. International Conference on Machine Learning (ICML), 2017.<br />
<br />
6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning<br />
Networks. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey<br />
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,<br />
2017.<br />
<br />
8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted<br />
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.<br />
<br />
9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/<br />
<br />
10. https://arxiv.org/pdf/1811.07819.pdf<br />
<br />
11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.<br />
<br />
12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.<br />
<br />
13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.<br />
<br />
14. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In<br />
Advances in Neural Information Processing Systems (NIPS) 2017.<br />
<br />
15. L P Kaelbling. Learning to achieve goals. In IJCAI-93. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, volume vol.2, pages 1094 – 8, 1993.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias&diff=42403Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias2018-12-12T01:29:46Z<p>Msminhas: Technical</p>
<hr />
<div>==Introduction==<br />
<br />
The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.<br />
<br />
This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models are not good enough and tend to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets. <br />
<br />
Like every other process, the process of collecting real-world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors: <br />
<br />
1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes<br />
<br />
2. Collect training data in 6 different homes and testing data in 3 homes<br />
<br />
3. Propose a method for modelling the noise in the labelled data<br />
<br />
4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation<br />
<br />
[[File:aa1.PNG|600px|thumb|center|]]<br />
<br />
==Overview==<br />
<br />
This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.<br />
<br />
As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an onboard laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.<br />
<br />
As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.<br />
<br />
==Learning on low-cost robot data==<br />
<br />
This paper uses a patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.<br />
<br />
===Grasping Formulation===<br />
<br />
Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The object is fixed in the z direction and basically perpendicular to the ground. The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used. <br />
<br />
Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.<br />
<br />
(Note: On Cross Entropy:<br />
<br />
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool. This is optimal, in that we can't encode the symbols using fewer bits on average.<br />
In contrast, cross entropy is the number of bits we'll need if we encode symbols from <math>y</math> using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:<br />
<br />
\begin{align}<br />
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}<br />
\end{align}<br />
<br />
Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)<br />
<br />
===Modeling noise as latent variable===<br />
<br />
In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2: <br />
<br />
<br />
[[File:aa2.PNG|600px|thumb|center|]]<br />
<br />
The conventional approach models the grasp success probability for a given image patch at a given angle where the variables of the environment which can introduce noise in the system is generally insignificant, due to the high accuracy of expensive, commercial robots. However, in the low cost setting with multiple robots collecting data in parallel, it becomes an important consideration for learning. <br />
<br />
The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.<br />
<br />
The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:<br />
<br />
<br />
\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]<br />
<br />
<br />
Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.<br />
<br />
===Learning the latent noise model===<br />
<br />
This section concerns what be the inputs to the NMN network should be and how should the inputs can be trained. The authors assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. Apart from the patch <math> I_{P} </math> and grasp information <math>(x, y, θ)</math>, they use information like image of the entire scene, ID of the robot and the location of the raw pixel. They argue that the image of the full scene could contain some essential information about the system such as the relative location of camera to the ground which may change over the lifetime of the robot. The identification number of the robot might give cues about errors specific to a particular hardware. Finally, the raw pixels of execution contain calibration specific information, since calibration error is coupled with pixel location, since least squares fit are used to to compute calibration parameters.<br />
<br />
They used direct optimization to learn both NMN and GPN with noisy labels. However, explicit labels are not available to train NMN but the latent variable <math>z</math> can be estimated using a technique such as Expectation-Maximization. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label <math>g</math>.<br />
<br />
===Training details===<br />
<br />
They implemented their model in PyTorch and fine tuned a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. This passes through a series of three fully connected layers and a SoftMax layer to convert the correct patch predictions to a probability distribution. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one. The angle predictions for all the patches are passed through a sigmoid activation at the end to obtain grasp success probability for a specific patch at a specific angle.<br />
<br />
The training of the network takes place in two stages. It starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously in an end-to-end fashion for the other 25 epochs.<br />
<br />
This two-stage approach is crucial for effective training of their networks, without which NMN trivially selects the same patch irrespective of the input. The optimizer used for training is Adam [16].<br />
<br />
==Results==<br />
<br />
In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.<br />
<br />
They collected data about planar grasping. This was done in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, object class information was discarded and used only the 2D location for making a sample grasp.The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.<br />
<br />
[[File:aa3.PNG|600px|thumb|center|]]<br />
<br />
To evaluate their approach in a more quantitative way, they used three test settings:<br />
<br />
- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.<br />
<br />
- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.<br />
<br />
- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.<br />
<br />
They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).<br />
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.<br />
<br />
===Experiment 1: Performance on held-out data===<br />
<br />
Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.<br />
<br />
[[File:aa4.PNG|600px|thumb|center|]]<br />
<br />
===Experiment 2: Performance on Real LCA Robot===<br />
<br />
In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high-quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high-quality sensing.<br />
<br />
[[File:aa5.PNG|600px|thumb|center|]]<br />
<br />
===Performance on Real Sawyer===<br />
<br />
To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for benchmarking, which is an accurate and better-calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy which is similar to several recent learning to grasp papers. This accuracy is similar to several recent papers, however, this model was trained and tested in a different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.<br />
<br />
[[File:aa6.PNG|600px|thumb|center|]]<br />
<br />
[[File:aa7.PNG|600px|thumb|center|]]<br />
<br />
==Related work==<br />
<br />
Over the last few years, the interest of scaling up robot learning with large-scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large-scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.<br />
<br />
Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high-quality depth as input which seems to be a barrier for practical use of robots in real environments. High-quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.<br />
<br />
Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low-cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.<br />
<br />
Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.<br />
<br />
==Conclusion==<br />
<br />
All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots and costs under 3K USD. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their framework performed 33% better than a baseline DexNet model, which struggled with the typically poor depth sensing in common household environments with a lot of natural light. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.<br />
<br />
==Critiques==<br />
<br />
This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.<br />
<br />
Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is, in fact, a fine-tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.<br />
<br />
[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]<br />
<br />
<br />
The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.<br />
<br />
They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.<br />
<br />
==References==<br />
<br />
#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.<br />
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.<br />
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor-critic for image-based robot learning." Robotics Science and Systems, 2018.<br />
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.<br />
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.<br />
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.<br />
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.<br />
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419<br />
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.<br />
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.<br />
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.<br />
#Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a low-cost manipulator using data-efficient reinforcement learning. RSS, 2011.<br />
#David F Nettleton, Albert Orriols-Puig, and Albert Fornells. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33(4):275–306, 2010.<br />
#Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.<br />
#Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.<br />
#Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation&diff=42402Zero-Shot Visual Imitation2018-12-12T01:13:17Z<p>Msminhas: Technical</p>
<hr />
<div>This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018. <br />
<br />
==Introduction==<br />
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from. In this paper, an alternative<br />
paradigm is pursued wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss.<br />
Videos, models, and more details are available at [[https://pathak22.github.io/zeroshot-imitation/]].<br />
<br />
===Paper Overview===<br />
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images, instead of observation-action pairs. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.<br />
<br />
[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.]]<br />
<br />
This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'', and is learned by re-labeling states that the agent visited as goals and the actions the agent taken as prediction targets via self-supervised way. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.<br />
<br />
A major challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'', which essentially says that reaching the goal is more important than how it is reached. First, a forward model that predicts the next observation from the given action and current observation is learned. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss does not inadvertently penalize actions that are ''consistent'' with the ground-truth action, even though the actions are not exactly the same (but lead to the same next state). <br />
<br />
As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.<br />
<br />
Of course, when introducing something like forward-consistent loss, issues related to the number of steps needed to reach a certain goal become of interest since different goals require different number of steps. To address this, the paper pairs the GSP with a goal recognizer (as an optimizer) to determines whether the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram (d) showing the forward-consistent loss proposed in this paper.<br />
<br />
The paper refers to this method as zero-shot, as the agent never has access to expert actions regardless of being in the training or task demonstration phase. This is different from one-shot imitation learning, where agents have full knowledge of actions and expert demos during the training phase. The agent learns to imitate instead of learning by imitation. The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation, and a series of navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.<br />
<br />
===Related Work===<br />
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.<br />
<br />
Imitation Learning: The two main threads are behavioral cloning and inverse reinforcement learning. For recent work in imitation learning, it required the expert actions to expert actions. Compared with this paper, it does not need this.<br />
<br />
Visual Demonstration: Several papers focused on relaxing this supervision to visual observations alone and the end-to-end learning improved results.<br />
<br />
Forward/Inverse Dynamics and Consistency: Forward dynamics model for planning actions has been learned but there is not consistent optimizer between the forward and inverse dynamics.<br />
<br />
Goal Conditioning: In this paper, systems work from high-dimensional visual inputs instead of knowledge of the true states and do not use a task reward during training.<br />
<br />
==Learning to Imitate Without Expert Supervision==<br />
<br />
In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described. <br />
<br />
Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment. This exploration data is used to learn the GSP policy.<br />
<br />
<br />
<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div><br />
<br />
<br />
The learned GSP policy (<math display="inline">π</math>) takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs a sequence of actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> starting from the current observation <math display="inline">x_i</math>. The states (observations) <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and need not be consecutive. Given the start and stop states, the number of actions <math display="inline">K</math> is also known. <math display="inline">π</math> can be though of as a deep network with parameters <math display="inline">θ_π</math>. <br />
<br />
At test time, the expert demonstrates a task from which the agent captures a sequence of observations. This set of images is denoted by <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math>. The sequence needs to have at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses its learned policy to start from initial state <math display="inline">x_0</math> and generate actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to follow the observations in <math display="inline">D</math>.<br />
<br />
The agent does not have access to the sequence of actions performed by the expert. Hence, it must use the observations to determine if it has reached the goal. A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the current goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes <br />
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. Then after reaching sufficiently close to <math display="inline">x_1^d</math>, the agent sets <math display="inline">x_2^d</math> as the goal and executes actions. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.<br />
<br />
===Learning the Goal-Conditioned Skill Policy (GSP)===<br />
<br />
In this section, first, the one-step version GSP policy is described. Next, it is extend it to the multi-step version. <br />
<br />
A one-step trajectory can be described as <math display="inline">(x_t; a_t; x_{t+1})</math>. Given <math display="inline">(x_t, x_{t+1})</math> the GSP policy estimates an action, <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math>. During training, cross-entropy loss is used to learn GSP parameters <math display="inline">θ_π</math>:<br />
<br />
<br />
<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div><br />
<br />
<br />
<math display="inline">a_t</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted actions respectively. The conditional distribution <math display="inline">p</math> is not readily available so it needs to be empirically approximated using the data. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math>; given a specific input, the network outputs a single output. However, in this problem multiple actions can lead to the same output. Multiple outputs given a single input can be modeled using a variation auto-encoder. However, the authors use a different approach explained in sections 2.2-2.4 and in the following sections.<br />
<br />
===Forward Consistency Loss===<br />
<br />
To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (the observation from executing the action predicted by GSP <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients (for actions that result in the same next observation) and aid the learning process. This is what is denoted as ''forward consistency loss''.<br />
<br />
To operationalize the forward consistency loss, we need a differentiable "forward dynamics" model that can reliably predict results of an action. The forward dynamics <math display="inline">f</math> are learned from the data by another model. Given an observation and the action performed, <math display="inline">f</math> predicts the next observation, <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency with the GSP network. In summary, the loss function is given below:<br />
<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div><br />
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div><br />
<br />
Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust. This paper incorporates this by making the GSP predict feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math> rather than the input space. <br />
<br />
Learning the two models <math>θ_π,θ_f</math> simultaneously from scratch can cause noisier gradient updates. This is addressed by pre-training the forward model with the first term and GSP separately by blocking gradient flow. Fine-tuning is then done with <math>θ_π,θ_f</math> jointly. <br />
<br />
The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div><br />
<br />
<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div><br />
<br />
<br />
The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.<br />
<br />
===Goal Recognizer===<br />
<br />
The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem that given an observation <math>x_i</math>, goal <math>x_g</math> infers whether <math>x_i</math> is close to <math>x_g</math>. Goal observations is drawn at random from the agent's experience due to lack of expert supervision of the goals, using those observations is because they are feasible. Additionally, a maximum number of iterations is also used to prevent the sequence of actions from getting too long.<br />
<br />
The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss. <br />
<br />
The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.<br />
<br />
===Ablations and Baselines===<br />
<br />
To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space. <br />
<br />
To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section: <br />
<br />
# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM. <br />
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup. <br />
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss. <br />
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forwarding consistency objective. <br />
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.<br />
# '''GSP''' refers to the complete method with all the components.<br />
<br />
==Experiments==<br />
<br />
The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.<br />
<br />
===Rope Manipulation===<br />
<br />
Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.<br />
<br />
In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper. <br />
<br />
For this experiment, the Baxter robot is set up exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.<br />
<br />
The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.<br />
<br />
The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.<br />
<br />
[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The<br />
robotics system setup. (b) The sequence of human demonstration images provided by the human<br />
during inference for the task of knot-tying (top row), and the sequences of observation states reached<br />
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human<br />
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’<br />
shape. Our agent is able to successfully imitate the demonstration.]]<br />
<br />
[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]<br />
<br />
===Navigation in Indoor Office Environments===<br />
In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.<br />
<br />
The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.<br />
<br />
Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.<br />
<br />
[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image<br />
(top-left). Since the initial and goal image has no overlap, the robot first explores the environment<br />
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42<br />
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and<br />
such exploratory behavior naturally emerged from the self-supervised learning.]]<br />
<br />
[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image<br />
of goal in an unseen environment. Each column represents a different run of our system for a<br />
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given<br />
a successful run but reaches the goal successfully at a much higher rate.]]<br />
<br />
Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.<br />
<br />
[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of<br />
images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration<br />
has no overlap with its current observation. Even under this condition, the robot is able to move closer<br />
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration<br />
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint<br />
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from<br />
WayPoint-1.]]<br />
<br />
[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in<br />
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three<br />
runs of two different demonstrations. Results show that our method outperforms the baselines. Note<br />
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions<br />
and neither model succeeded. Detailed results are available in the supplementary materials.]]<br />
<br />
===3D Navigation in VizDoom===<br />
<br />
To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space. <br />
<br />
Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.<br />
<br />
Table 3 shows the results of the VizDoom experiments. They have reported the median of maximum distance reached by the robot in following the give sequence of demonstration images. The maximum distance reached is the distance of farthest landmark point that the agent reaches contiguously. Additionally, the ratio of number of steps taken by the agent to reach the landmark with respect to the number of steps shown in human demonstrations is also reported. The key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.<br />
<br />
[[File:8-Table3.png | 650px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual<br />
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for<br />
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]<br />
<br />
==Discussion==<br />
<br />
This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.<br />
<br />
A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen. <br />
<br />
The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.<br />
<br />
This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work. Future work would also explore how multiple tasks can be combined into a single model, where different tasks might come from different contexts. Finally, it would be exciting to explore explicit handling of domain shift in future work, so as to handle large differences in embodiment and learn skills directly from videos of human demonstrators obtained, for example, from the Internet.<br />
<br />
==Critique==<br />
1. The paper is well written and could be easily understood. In addition, the experimental evaluations are promising. Also, the proposed method is a novel and interesting so that it could be used as an alternative to pure RL. <br />
<br />
2. In the paper, the authors didn't mention clearly why zero-shot imitation instead of a trained reinforcement learning model should be used. So, they need to provide more details about this issue.<br />
<br />
3. It is surprised that experimental evaluations on real robots. However, the scalability of this paper is not demonstrated, how to extend it to higher dimensional action spaces and whether it is expensive in high dimensional action spaces.<br />
<br />
4. I think having another test where the goal is fixed and the robot remains in its original position would show some interesting insight. Even having the obstacles move around would be some possible to integrate in the test.<br />
<br />
==References==<br />
<br />
[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.<br />
<br />
[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning<br />
from demonstration. Robotics and autonomous systems, 2009.<br />
<br />
[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood<br />
Cliffs, NJ, 1977.<br />
<br />
[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke<br />
by poking: Experiential learning of intuitive physics. NIPS, 2016.<br />
<br />
[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination<br />
for robotic grasping with large-scale data collection. In ISER, 2016.<br />
<br />
[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and<br />
700 robot hours. ICRA, 2016.<br />
<br />
[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey<br />
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.<br />
ICRA, 2017.<br />
<br />
[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration<br />
by self-supervised prediction. In ICML, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching&diff=42401Co-Teaching2018-12-12T01:02:27Z<p>Msminhas: Technical</p>
<hr />
<div>=Introduction=<br />
==Title of Paper==<br />
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels<br />
==Contributions==<br />
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously, which focuses on training on selected clean instances and avoids estimating the noise transition matrix. In addition, using stochastic optimization with momentum to train the deep networks and clean data can be memorized by nonlinear deep networks, which becomes robust. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness of deep learning models trained by the co-teaching approach is much superior to state-of-the-art baselines<br />
<br />
==Terminology==<br />
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data. <br />
<br />
Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels. This can result in false positives.<br />
<br />
=Intuition=<br />
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually when learning on a batch of noisy data, only the error from the network itself is transferred back to facilitate learning. But in the case of Co-teaching, the two networks are able to filter different type of errors, and flow back to itself and the other network. As a result, the two models learn together, from the network itself and the partner network.<br />
<br />
=Motivation=<br />
The paper draws motivation from two key facts:<br />
• That many data collection processes yield noisy labels. <br />
• That deep neural networks have a high capacity to overfit to noisy labels. <br />
Because of these facts, it is challenging to train deep networks to be robust with noisy labels. <br />
=Related Works=<br />
<br />
1. Statistical learning methods: Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modelling. In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modelling category, there is a two coin model proposed to handle noise labels from multiple annotators. <br />
<br />
2. Deep learning methods: There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously. <br />
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators. <br />
<br />
3. Learning to teach methods: It is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. Most works did not account for noisy labels, with exception to MentorNet, which applied the idea on data with noisy labels.<br />
<br />
=Co-Teaching Algorithm=<br />
<br />
[[File:Co-Teaching_Algorithm.png|600px|center]]<br />
<br />
The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch using mini-batch gradient descent, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network. <math>R(T)</math> governs the percentage of small-loss instances to be used in updating the parameters of each network.<br />
<br />
=Summary of Experiment=<br />
==Proposed Method==<br />
The proposed co-teaching method trains two networks simultaneously, and samples instances with small loss at each mini batch as useful knowledge. The sample of small-loss instances is then taught to the peer network for updating the parameters.<br />
[[File:Co-Teaching Fig 1.png|600px|center]] <br />
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate). <br />
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself. To summarize, as error from one network will not be directly transferred back itself, the authors expect that the Co-teaching method will be able to deal with heavier noise compared with the self-evolving one.<br />
<br />
==Dataset Corruption==<br />
The datasets incorporated by this paper include MNIST, CIFAR-10 and CIFAR-100. A summary of these datasets are shown as below. <br />
<br />
[[File:co_teaching_data.png|600px|center]] <br />
<br />
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix<math>Q</math>, where where <math>Q_{ij} = Pr(\widetilde{y} = j|y = i)</math> given that noisy <math>\widetilde{y}</math> is flipped from clean <math>y</math>. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry. <br />
[[File:Co-Teaching Fig 2.png|600px|center]] <br />
Three noise conditions are simulated for comparing co-teaching with baseline methods.<br />
<br />
Note: Corruption of Dataset here means randomly choosing a wrong label instead of the target label by applying noise. <br />
<br />
{| class="wikitable"<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Method<br />
|width="100pt"|Noise Rate<br />
|width="700pt"|Rationale<br />
|-<br />
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels. <br />
|-<br />
| Symmetry || 50% || Half of the instances have noisy labels. Labels have a constant probability of being corrupted. Further rationale can be found at [1].<br />
|-<br />
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario. <br />
|}<br />
|}<br />
<br />
==Baseline Comparisons==<br />
The co-teaching method is compared with several baseline approaches, which have varying:<br />
• proficiency in dealing with a large number of classes,<br />
• ability to resist heavy noise,<br />
• need to combine with specific network architectures, and<br />
• need to be pretrained. <br />
<br />
[[File:Co-Teaching Fig 3.png|600px|center]] <br />
===Bootstrap===<br />
The general idea behind bootstrapping is to dynamically change (correct) noisy labels during training. The idea is to take a value derived from the original and predicted class. The final label is some convex combination of the two. It should be noted that the weighting of the prediction is increased over time to account for the model itself improving. Of course, this procedure needs to be finely tuned to prevent it from rampantly changing correct labels before it becomes accurate. [2].<br />
<br />
===S-Model===<br />
Using an additional softmax layer to model the noise transition matrix [3].<br />
===F-Correction===<br />
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].<br />
===Decoupling===<br />
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].<br />
===MentorNet===<br />
A mentor network weights the probability of data instances being clean/noisy in order to train the student network on cleaner instances [6].<br />
<br />
As shown in the above table - few of the advantages of Co-teaaching method include - Co-teaching<br />
method does not rely on any specific network architectures, which can also deal with a large number of classes and is more robust to noise. Besides, it can be trained from scratch. This makes teaching more appealing for practical usage.<br />
<br />
==Implementation Details==<br />
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization. <br />
<br />
[[File: Co-Teaching Table 3.png|center]] <br />
=Results and Discussion=<br />
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows. <br />
==MNIST==<br />
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations<br />
<br />
The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet. <br />
<br />
Robustness of the proposed method to noise which plays an important rule in the evaluation, is evident in the plots which is better or comparable to the other methods.<br />
<br />
[[File:Co-Teaching Table 4.png|550px|center]]<br />
<br />
[[File:Co-Teaching Graphs MNIST.PNG|center]]<br />
<br />
==CIFAR10==<br />
The observations here are consistently the same as these for MNIST dataset.<br />
[[File:Co-Teaching Table 5.png|550px|center]] <br />
<br />
[[File:Co-Teaching Graphs CIFAR10.PNG|center]]<br />
==CIFAR100==<br />
[[File:Co-Teaching Table 6.png|550px|center]] <br />
<br />
[[File: Co-Teaching Graphs CIFAR100.PNG|center]]<br />
==Choice of R(T) and <math> \tau</math>==<br />
There were some principles they followed when it came to choosing R(T) and <math> \tau</math>. R(T)=1, there was no instance needed at the beginning. They could safely update parameters in the early stage using the whole noise data since the deep neural networks would not memorize the noisy data. However, they need to drop more instances at the later stage. Because the model would eventually try to fit noisy data.<br />
<br />
R(T)=1-<math> \tau </math> *min{<math>T^{c}/T_{k},1 </math>} with <math> \tau=\epsilon </math>, where <math> \epsilon </math> is noise level.<br />
In this case, we consider c={0.5,1,2}. From Table 7, the test accuracy is stable.<br />
<br />
[[File: Co-Teaching Table 7.png|550px|center]] <br />
<br />
For <math> \tau</math>, we consider <math> \tau={0.5,0.75,1,1.25,1.5}\epsilon</math>. From Table 8, the performance can be improved with dropping more instances.<br />
[[File: Co-Teaching Table 8.png|550px|center]]<br />
<br />
=Conclusions=<br />
The main goal of the paper is to introduce the “Co-teaching” learning paradigm that uses two deep neural networks learning simultaneously to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depending on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).<br />
<br />
=Future Work=<br />
For future work, the paper can be extended in following ways: First , the the Co-teaching program can be adapted to train deep models under weak supervisions , e.g positive and unlabeled data. Second theoretical guarantees for Co-teaching can be investigated. The current approach seems to be have potential application in eliminating noisy labels/data from biomedical signals for example in the case of EEG data. This is important as EEG data are generally collected based on an experimental protocol and under controlled lab conditions. When data is collected in this way, even though the underlying brain process does not correspond to the EEG signals being collected, they can be labelled incorrectly based on the experimental protocol. Such cases of wrong labeling/data need to be eliminated from the training process and this is one scenario where co-teaching could possibly be applied. Also, this method seems to have potential application in data collected via crowd-sourcing or same data being labelled by multiple human subjects. Further, there is no analysis for generalization performance on deep learning with noisy labels which can also be studied in future.<br />
<br />
=Critique=<br />
The paper evaluates the performance considering the complexity of computations and implementations of the algorithms. Co-teaching methodology seems an interesting idea but can possibly become tricky to implement. Technically, such complexity can negatively impact the performance of the algorithm. <br />
==Lack of Task Diversity==<br />
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality. <br />
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==<br />
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm. <br />
==Lack of Theoretical Development (Mentioned in conclusion)==<br />
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.<br />
<br />
=References=<br />
[1] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The<br />
importance of being unhinged. In NIPS, 2015.<br />
<br />
[2] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural<br />
networks on noisy labels with bootstrapping. In ICLR, 2015.<br />
<br />
[3] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer.<br />
In ICLR, 2017.<br />
<br />
[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to<br />
label noise: A loss correction approach. In CVPR, 2017.<br />
<br />
[5] E. Malach and S. Shalev-Shwartz. Decoupling" when to update" from" how to update". In<br />
NIPS, 2017.<br />
<br />
[6] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum<br />
for very deep neural networks on corrupted labels. In ICML, 2018.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=42400stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-12-12T00:50:21Z<p>Msminhas: Technical</p>
<hr />
<div>== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs rather than vector-based features and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. Deterministic pooling methods are efficient and simple, but can hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. The neighborhood approach is used in all the mentioned pooling methods due to its simplicity and efficiency. Nevertheless, the approach can cause edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
For further information on wavelets, follow this link to MathWorks' [https://www.mathworks.com/videos/understanding-wavelets-part-1-what-are-wavelets-121279.html Understanding Wavelets] video series.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth out or 'dilute' details in features.<br />
<br />
Pooling is often introduced within networks to ensure local invariance to prevent overfitting due to small transitional shifts within an image. Despite the effectiveness of traditional pooling methods such as max pooling introduce this translational invariance by discarding information using methods analogous to nearest neighbour interpolation. With the hope of providing a more organic way of pooling, the authors leverage all information within cells inputted within a pooling operation with the hope that the resulting dim-reduced features are able to contain information from all high level cells using various dot products.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* Manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* Pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. These pooling methods reduce input data dimensionality by taking the maximum value or the average value of specific areas and condense them into one single value. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 1 gives a quick visual example of max and average pooling:<br />
<br />
[[File: pooling.png| 700px|center]]<br />
<br />
Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': In general, when facing a new problem in which one would want to use a CNN, it is not intuitively known whether average or max-pooling should be preferred. Notably, both techniques have significant drawbacks. Average pooling forces the network to consider low magnitude (and possibly irrelevant information) in constructing representations, while max pooling can force the network to ignore fundamental differences between neighboring groups of pixels. To counteract this, mixed pooling probabilistically decides which to use during training / testing. It should be noted that, for training, it is only probabilistic in the forward pass. During back-propagation the network defaults to the earlier chosen method. Mixed pooling can be applied in 3 different ways.<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling for a particular region/feature/layer.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
<br />
A wavelet is a representation of a square integrable function by a certain orthonormal series generated by a wavelet. The fundamental idea of wavelet transforms is that the transformation should allow only changes in time extension, but not shape. This is affected by choosing suitable basis functions that allow for this. Changes in the time extension are expected to conform to the corresponding analysis frequency of the basis function. <br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occurring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
The following images show the result of applying a wavelet transform to an image for denoising:<br />
<br />
[[File: Noise Wavelet.jpg| 700px]] [[File: Denoised Wavelet.jpg| 700px]]<br />
<br />
images were taken from [https://en.wikipedia.org/wiki/Discrete_wavelet_transform#Example_in_Image_Processing here].<br />
<br />
== Proposed Method ==<br />
<br />
The previously highlighted pooling methods use neighborhoods to subsample, almost identical to nearest neighbor interpolation.<br />
<br />
The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression. The authors say that this organic reduction, therefore, lessens the creation of jagged edges and other artifacts that may impede correct image classification.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based on the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintains a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
<br />
<br />
* Computational Complexity:<br />
Above experiments and implementations on wavelet pooling were more of a proof-of-concept rather than an optimized method. In terms of mathematical operations, the wavelet pooling method is the least computationally efficient compared to all other pooling methods mentioned above. Among all the methods, average pooling is the most efficient methods, max pooling and mix pooling are at a similar level while wavelet pooling is way more expensive to complete the calculation.<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
The authors' results confirm previous studies proving that no one pooling method is superior, but some perform better than others depending on the dataset and network structure Boureau et al. (2010); Lee et al. (2016). Furthermore, many networks alternate between different pooling methods to maximize the effectiveness of each method. [1]<br />
<br />
Future work and improvements in this area could be to vary the wavelet basis to explore which basis performs best for the pooling. Altering the upsampling and downsampling factors in the decomposition and reconstruction can lead to better image feature reductions outside of the 2x2 scale. Retention of the subbands we discard for the backpropagation could lead to higher accuracies and fewer errors. Improving the method of FTW we use could greatly increase computational efficiency. Finally, analyzing the structural similarity (SSIM) of wavelet pooling versus other methods could further prove the vitality of using the authors' approach. [1]<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction need to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
* Adding asymptotic notations to the computational complexity of the proposed algorithm would be meaningful, particularly since the given results are for a single/fixed input size (one image in forward propagation) and consequently are not generalizable. <br />
* They could have considered comparing against Fast Fourier Transform (FFT). Including a non-wavelet form seems to be an obvious candidate for comparison<br />
* If they went beyond the 2x2 pooling window this would have further supported their method<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) The experiments are largely conducted with very small scale datasets. As a result, I am not sure if they are representative enough to show the performance difference between different pooling methods.<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) No comparison to non-wavelet methods. For example, one obvious comparison would have been to look at using a DCT or FFT transform where the output would discard high-frequency components (this can get very close to the wavelet idea!). Also, this critique might provides us with some interesting research directions since DCT or FFT transforms as pooling are not throughly studied yet.<br />
* Also, convolutional neural network are not only used in image related tasks. Evaluating the efficiency of wavelet pooling in convolutional neural network applied to natural languages or other applicable areas will be interesting. Such experiments shall also show if such approach can be generalized. <br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.<br />
<br />
<br />
== Revisions == <br />
<br />
*Two reviewers really liked the paper and one of them called it in the top 15% papers in the conference which supports the novelty and potential of the idea. One other reviewer, however, believed that this was not good enough to be accepted and the main reason for rejection was the linearity nature of wavelet(which was not convincingly described). <br />
<br />
*The main concern of two of the reviewers has been the size of the datasets that have been used to test the method and the authors have mentioned future works concerning bigger datasets to test the method.<br />
<br />
*The computational cost section had not been included in the paper initially and was added after one of the reviewer's concern. So, the other reviewers have not been curious about this and unfortunately, there is no comment on that from them. However, the description on the non-efficient implementation seemed to be satisfactory to the reviewer which resulted in being accepted. <br />
<br />
[https://openreview.net/forum?id=rkhlb8lCZ Revisions]<br />
<br />
At the end, if you are interested in implementing the method, they are willing to share their code but after making it efficient. So, maybe there will be another paper regarding less computational cost on larger datasets with a publishable code.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding&diff=42399Pixels to Graphs by Associative Embedding2018-12-12T00:39:46Z<p>Msminhas: Technical</p>
<hr />
<div>== Introduction == <br />
<br />
Extracting semantics from images is one of the main goals of computer vision. Recent years have seen rapid progress in the classification and localization of objects [7, 24, 10]. But a bag of labeled<br />
and localized objects is an impoverished representation of image semantics: it tells us what and where the objects are (“person” and “car”), but does not tell us about their relations and interactions (“person next to car”). A necessary step is thus to not only detect objects but to identify the relations between them. An explicit representation of this semantics is referred to as a scene graph where we represent objects grounded in the scene as vertices and the relationships between them as edges. [1]<br />
<br />
End-to-end training of convolutional networks has proven to be a highly effective strategy for image understanding tasks. It is therefore natural to ask whether the same strategy would be viable for predicting graphs from pixels. Existing approaches, however, tend to break the problem down into more manageable steps. For example, one might run an object detection system to propose all of the objects in the scene, then isolate individual pairs of objects to identify the relationships between them. This breakdown often restricts the visual features used in later steps and limits reasoning over the full graph and over the full contents of the image. [1]<br />
<br />
The paper presents a novel approach to generating a scene graph. A scene graph, as it relates to an image, is a graph with a vertex that represents each object identified in the image and an edge that represents relationships between the objects. <br />
<br />
An example of a scene graph:<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Scene Graph.PNG]]</div><br />
<br />
Current state-of-the-art techniques break down the construction of scene graphs by first identifying objects and then predicting the edges for any given pair of identified objects. By using this technique, reasoning over<br />
the full graph would be limited. On the other hand, this paper introduces an architecture that defines the entire graph directly from the image, enabling the network to reason across the entirety of the image to understand relationships, as opposed to only predicting relationships using object labels. <br />
<br />
A key concern, given that the new architecture produces both vertices (objects) and edges (relationships), is connecting the two. Specifically, the output of the network is some set of relationships E, and some set of vertices V. The network needs to also output the “source” and “destination” of each relationship so that the final graph can be formed. In the image above, for example, the network would also need to tell us that “holding” comes from “person” and goes to “Frisbee”. To do this, the paper uses associative embeddings. Specifically, the network outputs a particular “embedding vector” for each vertex, as well as a “source embedding” and “destination embedding” for each relationship. A final post-processing step finds the vertex embedding closest to each of the source/destination embeddings of each relationship and in this way assigns the edges to pairs of vertices.<br />
<br />
== Previous Work == <br />
=== Relationship detection===<br />
<br />
Relationship detection aims at correctly determining the relationships between pairs of objects and ground that is in the image with accurate object bounding boxes. Visual relationship is the center of attraction recently. Since the task itself is open-ended and challenging, it leads to variety of diverse approaches and solutions.<br />
<br />
In the field of relationship detection, the following are the existing state of the art advances:<br />
<br />
1) Framing the task of identifying objects using localization from referential expressions, detection of human-object interactions, or the more general tasks of Visual Relationship Detection (VRD) and scene graph generation. <br />
<br />
2) Visual relationship detection methods like message passing RNNs and predicting over triplets of bounding boxes. <br />
<br />
=== Associative Embedding ===<br />
<br />
There are various contexts where associative embedding are used. To take an example they are used to measure the similarity between pairs of images. Recently vector embedding have been used to group together body joints for multi-person pose estimation. These are referred to as associative embeddings since supervision does not require the network to output a particular vector value, and instead uses the distances between pairs of embedding to calculate a loss. The important thing is no the exact value of the vector but how it relates to the other embedding produced by the network.<br />
<br />
<br />
In the field of associative embedding, the following are some interesting applications: <br />
<br />
1) Vector embeddings to group together body joints for multi-person pose estimation. <br />
<br />
2) Vector embeddings to detect body joints of the various people in an image.<br />
<br />
<br />
Reference Figure from the paper "Associative embedding: End-to-end learning for joint detection and grouping."<br />
<br />
[[File:Oct30_associative_embedding_appendix_fig2.jpg | center]]<br />
<br />
== Pixels To Graphs == <br />
The goal of the paper is to construct a graph from a set of pixels. In particular, to construct a graph<br />
grounded in the space of these pixels. Meaning that in addition to identifying vertices of the graph,<br />
we want to know their precise locations. A vertex, in this case, can refer to any object of interest in the<br />
scene including people, cars, clothing, and buildings. The relationships between these objects is then<br />
captured by the edges of the graph. These relationships may include verbs (eating, riding), spatial<br />
relations (on the left of, behind), and comparisons (smaller than, same color as).<br />
<br />
Formally we consider a directed graph G = (V, E). A given vertex vi ∈ V is grounded at a location (<math>xi</math><br />
,<math>yi</math>) and defined by its class and bounding box. Each edge e ∈ E takes the form<br />
ei = (<math>vs</math>,<math>vt</math> ,<math>ri</math>) defining a relationship of type <math>r_i</math> from <math>vs</math> to <math>vt</math> . We train a network to explicitly define V and E. This training is done end-to-end on a single network, allowing the network to reason fully over the image and all possible components of the graph when making its predictions<br />
<br />
== The Architecture: == <br />
: '''1. Detecting Graph Elements'''<br />
<br />
Given an image of dimensions h x w, a stacked hourglass (Appendix 2) is used to generate a h x w x f representation of the image. It should be noted that the dimension of the output (which is non-trainable), needs to fulfill certain criteria. Specifically, we need to have a resolution large enough to minimize the number of pixels with multiple detections while also being small enough to ensure that each 1 x 1 x f vector still contains the information needed for subsequent inference.<br />
<br />
A 1x1 convolution and sigmoid activation is performed on this result to generate a heat map (one for objects and one for relationships, using separately determined convolutions). The value at a given pixel can be interpreted as the likelihood of detection at that particular pixel in the original image. <br />
<br />
In order to claim that there is an element at some pixel, we need to have some likelihood threshold. Then, if a given pixel in the map has a value >= the threshold, we claim that there is an element at that pixel. This threshold is calculated by using binary cross-entropy loss on the final values in the heat map. Values with likelihoods greater than p-hat will be considered element detections. <br />
<br />
Finally, for each element that we detected, we extract the 1 x 1 x f feature vector. This is then used as an input to a set of Feed Forward Neural Networks (FFNNs), where we have a separate network for each characteristic of interest, and for each network, there's one hidden layer with f nodes. The object class and relationship (edges) could be supervised by softmax loss. Furthermore, in order to predict the bounding box of the object, we can use the approach proposed by the Faster-RCNN model[3]. The following image summarizes the process.<br />
<br />
<br />
[[File:Extraction Process.PNG|center|900px]]<br />
<br />
:'''2. Connecting Elements with Associative Embeddings'''<br />
As explained earlier, to construct the scene graph, we need to know the source and destination of each edge. This is done through associative embeddings. <br />
<br />
First, let us define an embedding hi ϵ Rd produced for some vector i, and let us assume that we have n object detections in a particular image. Now, define hik, for k = 1 to Ki (where Ki is the number of edges in the graph with a vertex at vertex i) as the embedding associated with an edge that touches vertex i. We define two loss functions on these sets.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 1.PNG]]</div><br />
<br />
The goal of Lpull is to minimize the squared differences between the embedding of a given vertex and the embedding of an edge that references said vertex.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 2.PNG]]</div><br />
<br />
On the other hand, minimizing Lpush implies assigning embeddings to vertices that are as far apart as possible. The further apart they are, the lower the output of max becomes until eventually, it reaches 0. Here, m is just a constant. In the paper, the values used were m = 8 and d = 8 (that is, 8D embeddings). Combining these two loss functions (and weighing them equally), accomplishes the task of predicting embeddings such that vertices are differentiated, but the embedding of a vertex is most similar to the vertex it references.<br />
<br />
:'''3. Support for Overlapping Detections'''<br />
An obvious concern is how the network would operate if there was more than one detection (be it object or relationship), in a given pixel. For example, detection of “shirt” and “person” may be centered at the exact same pixel. To account for this, the architecture is modified to allow for “slots” at each pixel. Specifically, so detections of objects are allowed at a particular pixel, while sr relationship detections are allowed at a given pixel. <br />
<br />
In order to allow for this, some changes are required after the feature extraction step. Specifically, we now use the 1x1xf vector as the input for so (or sr) different sets of 4 FFNNs, where the output (of the first three) is as shown in figure 2, and with the final FFNN outputting the probability of a detection existing in that particular slot, at that particular pixel. This new network is trained exclusively on whether or not a detection has been made in that slot, and, in prediction, is used to determine the number of slots to output at a given pixel. It is critical to note that this each of these so (or sr) sets of FFNNs share absolutely no weights. And each is trained for detection in its assigned slot.<br />
<br />
It is important to note that this implies a change in the training procedure. We now have so (or sr) different predictions (be it class, or class + bounding box), that we need to match with our set of ground truth detections at a given pixel. Without this step, we would not be able to assign a value to the error for that sample. To do this, we match a one-hot encoded vector of the ground-truth class and bounding box anchor (the reference vector), and then match them with the so (or sr) outputs provided at a given pixel. The Hungarian method is used to ensure maximum matching between the outputs and the reference method while ensuring we do not assign the same detection to multiple slots.<br />
<br />
==Results==<br />
A quick note on notation: R@50 indicates what percentage of ground-truth subject-predicate-object tuples appeared in a proposal of 50 such tuples. Since R@100 offers more possibilities, it will necessarily be higher. The 6.7, for example, indicates that 6.7% of the ground truth tuples appeared in the proposals of the network. <br />
<br />
The authors tested the network against two other architectures designed to develop a semantic understanding of images. For this, they used the Visual Genome dataset, with so = 3 and sr = 6. Overall, the new architecture vastly outperformed past models. The results were as follows:<br />
<br />
The table can be interpreted as follows:<br />
<br />
[[File:Results Table.PNG|center|600px]]<br />
<br />
::'''SGGen (no RPN)''': Given a particular image, without the use of Region Proposal networks, the accuracy of the proposed scene graph. No class predictions are provided.<br />
::'''SGGen (with RPN)''': Same as above, except the output of the Region Proposal Network, is used to enhance the input of a given image. No class predictions are provided.<br />
::'''SGCIs''': Ground-truth object bounding boxes are provided. The network is asked to classify them and determine relationships.<br />
::'''PredCIs''': As above, except the classes are also provided. The only goal is to predict relationships.<br />
<br />
Further analysis into the accuracy, when looking at predicates individually, shows that the architecture is very sensitive to over-represented relationship predicates.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Results - Part 2.PNG]]</div><br />
<br />
As shown in Figure 5, for many ground-truth predicates (those that do not appear often in the ground truth), the network does poorly. Even when allowed to propose 100 tuples, the network does not offer the predicate. Figure 4 simply observes the fact that certain sets of relationship predicates appear predominantly in a subset of slots. No general explanation has been offered for this behavior.<br />
<br />
== Conclusion ==<br />
In conclusion, the paper offers a novel approach that enables the extraction of image semantics while perpetually reasoning over the entire context of the image. Associative embeddings are used to connect object and predicate relationships, and parallel “slots” allow for multiple detections in one pixel. While this approach offers noticeable improvements in accuracy, it is clear that work needs to be done to account for the non-uniform distributions of relationships in the dataset.<br />
<br />
<br />
== Critiques ==<br />
<br />
The paper's contributions towards patterning unordered network outputs and using associative embeddings for connecting vertices and edges are commendable. However, it should be noted this paper is only an incremental improvement over existing well-studied architectures like the hourglass architecture. The modifications are not sufficiently supported by mathematical reasoning. The authors say that they make a slight modification to the hourglass design and double the number of features and weight all the loses equally. No scientific justification for why this is needed is given. Also the choice of constants to be 3 and 6 for <math display = "inline"> s_o</math> and <math display = "inline"> s_r</math> is not clear, as the authors leave out a fraction of the cases. I am not sure if the changes made are truly a critical advance as the experiments are conducted only on a single dataset and no generalizability arguments are made by the authors. So the methods might just work well only for this dataset and the changes may pertain to only this one. The theoretical analysis done in the paper comes directly from the hourglass literature and cannot be accounted for novelty.<br />
The paper could have identified the effect of their treatment by analyzing the structure of the network that they are presenting. However, there are lack of mathematical and structural analysis of each treatment that they are presenting in detailed levels.<br />
<br />
== Appendices ==<br />
<br />
'''Appendix 1: Sample Outputs'''<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Sample Pixel Graph Outputs.PNG]]</div><br />
<br />
'''Appendix 2: Stacked Hourglass Architecture'''<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Hourglass.PNG]]</div><br />
<br />
Although this goes beyond the focus of the paper, I would like to add a brief overview of the stacked hourglass architecture used to generate the heat map. This architecture is unique in that it allows cyclical top-down, bottom-up inference and recombination of features. While most architectures focus on optimizing the bottom-up portion (reducing dimensionality), the stacked-hourglass gives the network more flexibility in how it generates a representation by allowing it to learn a series of down-sampling / up-sampling steps.<br />
<br />
When you downsample and then upsample, a high amount of information is potentially lost on the upsampled reconstruction. Using the naive approach, this often results in poor reconstruction. This problem is accentuated when we stack multiple layers of downsampling and upsampling in the stacked hourglass architecture. To alleviate this issue, we add skip layers. Skip layers essentially allow earlier layers to send outputs into multiple later layers. The added information from the earlier layers ensures that the reconstructed embedding doesn't have its dimensionality reduced too much.<br />
<br />
[[File:skip+layers+Max+fusion+made+learning+difficult+due+to+gradient+switching..jpg|center|900px]]<br />
<br />
== References ==<br />
1. Alejandro Newell and Jia Deng, “Pixels to Graphs by Associative Embedding,” in NIPS, 2017<br />
<br />
2. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016<br />
<br />
3. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, pages 91–99, 2015.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=conditional_neural_process&diff=42398conditional neural process2018-12-11T23:29:17Z<p>Msminhas: Editorial</p>
<hr />
<div>== Motivation ==<br />
<br />
Deep neural networks are good at function approximations, yet they are typically trained from scratch for each new function. While Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs are computationally expensive, and it can be hard to design appropriate priors. Hence the authors propose a family of neural models called, Conditional Neural Processes (CNPs), that combine the benefits of both.<br />
<br />
== Introduction ==<br />
<br />
To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive. <br />
<br />
The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.<br />
<br />
== Model ==<br />
<br />
=== Stochastic Processes ===<br />
<br />
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of <math>f</math>. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.<br />
<br />
Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.<br />
<br />
Let <math>P</math> be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, <math>P</math> defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>. <br />
<br />
A good example is given by the authors, consider a random 1-dimensional function <math>f ∼ P</math> defined on the real line (i.e., <math>X := R</math>, <math>Y := R</math>). <math>O</math> would constitute <math>n</math> observations of <math>f</math>’s value <math>y_i</math> at different locations <math>x_i</math> on the real line. Given these observations, we are interested in predicting <math>f</math>’s value at new locations on the real line. <br />
<br />
A common assumption made on <math>P</math> is that all function evaluations of <math display="inline"> f </math> are Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of the stochastic process allows a model to be data efficient. However, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of the examples is GPs, which has running time <math>O(n+m)^3</math>.<br />
<br />
[[File:001.jpg|300px|center]]<br />
<br />
=== Conditional Neural Process ===<br />
<br />
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.<br />
<br />
CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, the authors assume that <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.<br />
<br />
In detail, the following architecture is used.<br />
<br />
<math display="inline">r_i = h_\theta(x_i, y_i)</math> &forall; <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math><br />
<br />
<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math><br />
<br />
<math display="inline">\Phi_i = g_\theta</math> &forall; <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math><br />
<br />
Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.<br />
<br />
We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly<br />
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution<br />
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by<br />
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]<br />
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed <br />
and unobserved values. In practice, Monte Carlo estimates of the gradient of this loss is taken by sampling <math display="inline">f</math> and <math display="inline">N</math>. <br />
<br />
This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately<br />
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.<br />
<br />
In summary,<br />
<br />
1. A CNP is a conditional distribution over functions<br />
trained to model the empirical conditional distributions<br />
of functions <math display="inline">f \sim P</math>.<br />
<br />
2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.<br />
<br />
3. A CNP is scalable, achieving a running time complexity<br />
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math><br />
observations.<br />
<br />
== Related Work ==<br />
<br />
===Gaussian Process Framework===<br />
<br />
A Gaussian Process (GP) is a non-parametric method for regression, used extensively for regression and classification problems in the machine learning community. A GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.<br />
A standard approach is to model data as <math>y = m(X, φ) + \epsilon</math><br />
where <math>m</math> is the mean function with parameter vector <math>φ</math>, and <math>\epsilon</math> represents independent and identically distributed (i.i.d.) Gaussian noise: <math>N\sim (0,\sigma^2)</math><br />
<br />
For more info on Gaussian Process Framework:<br />
[https://arxiv.org/abs/1506.07304 A Gaussian process framework for modeling instrumental systematics: application to transmission spectroscopy]<br />
<br />
Several papers attempt to address various issues with GPs. These include:<br />
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)<br />
* Using Deep GPs to achieve more expressiveness (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017)<br />
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)<br />
<br />
A Python resource for Gaussian Process Framework implementation: [https://github.com/SheffieldML/GPyimplementation Gaussian Process Framework in Python]<br />
<br />
The goal of this paper is to incorporate ideas from standard neural networks with Gaussian processes in order to overcome drawbacks of both. Bayesian techniques work better with less data, but complex Bayesian networks become intractable on even moderate sized data sizes. NNs on the other hand, cannot make use of prior knowledge and often have to be retrained from scratch. Without sufficient data, they also perform poorly. Combining both frameworks, we get Conditional Neural Processes serves to learn the kernels of the Gaussian Process through neural networks and uses these learned kernels on a framework similar to GPs for prediction.<br />
<br />
===Meta Learning===<br />
<br />
Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms (Reed et al., 2017) or additional memory units in a VAE model (Bornschein et al., 2017). Another successful latent variable approach is to explicitly condition on some context during inference (J. Rezende et al., 2016). Given the generative nature of these models they are usually applied to image generation tasks, but models that include a conditioning class-variable can be used for classification as well. Recently meta-learning has also been applied to a wide range of tasks like RL (Wang et al., 2016; Finn et al., 2017) or program induction (Devlin et al., 2017).<br />
<br />
Classification is another common task in meta-learning. Few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations (Koch et al., 2015), (Santoro et al., 2016).. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs. In their case features of samples are compared with target features using an attention kernel. At a higher level one can interpret this model as a CNP where the aggregator is just the concatenation over all input samples and the decoder <math>g</math> contains an explicitly defined distance kernel. In this sense matching networks are closer to GPs than to CNPs, since they require the specification of a distance kernel that CNPs learn from the data instead. In addition, as MNs carry out all- to-all comparisons they scale with <math> O(n × m) </math>, although they can be modified to have the same complexity of <math>O(n + m)</math> as CNPs (Snell et al., 2017).<br />
<br />
Another field in the meta-learning field is Neural architecture search. It requires the search algorithm to define three things: the search space, search strategy, and performance evaluation strategy. It is one of the most popular trends in the meta-learning field now. The idea is we can define some search space, and let algorithms help us decide what architecture and hyperparameters would be best for a particular task. Also, since evaluating a neural network is expensive(needs train the neural network first), it needs a well designed performance evaluation strategy to lower down the computational cost<br />
<br />
A model that is conceptually very similar to CNPs (and in particular the latent variable version) is the “neural statistician” paper (Edwards & Storkey, 2016) and the related variational homoencoder (Hewitt et al., 2018). As with the<br />
other generative models the neural statistician learns to estimate the density of the observed data but does not allow for targeted sampling at what we have been referring to as input positions <math>x_i</math>. Instead, one can only generate i.i.d. samples from the estimated density. Finally, the latest variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). For example, Gal & Ghahramani 2016 develop a new theoretical framework casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes. Their theory extracts information from existing models and gives us tools to model uncertainty.<br />
<br />
== Experimental Result I: Function Regression ==<br />
<br />
Classical 1D regression task that used as a common baseline for GP is the first example. <br />
They generated two different datasets that consisted of functions<br />
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset, the function switched at some random point. on the real line between two functions, each sampled with<br />
different kernel parameters. At every training step, they sampled a curve from the GP, select<br />
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three-layer MLP encoder h with a 128-dimensional output representation. The representations are aggregated into a single representation<br />
<math display="inline">r = \frac{1}{n} \sum r_i</math><br />
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer<br />
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer. <br />
<br />
Two examples of the regression results obtained for each<br />
of the datasets are shown in the following figure.<br />
<br />
[[File:007.jpg|300px|center]]<br />
<br />
They compared the model to the predictions generated by a GP with the correct<br />
hyperparameters, which constitutes an upper bound on our<br />
performance. Although the prediction generated by the GP<br />
is smoother than the CNP's prediction both for the mean<br />
and variance, the model is able to learn to regress from a few<br />
context points for both the fixed kernels and switching kernels.<br />
As the number of context points grows, the accuracy<br />
of the model improves and the approximated uncertainty<br />
of the model decreases. Crucially, we see the model learns<br />
to estimate its own uncertainty given the observations very<br />
accurately. Nonetheless, it provides a good approximation<br />
that increases in accuracy as the number of context points<br />
increases.<br />
Furthermore, the model achieves similarly good performance<br />
on the switching kernel task. This type of regression task<br />
is not trivial for GPs whereas in our case we only have to<br />
change the dataset used for training<br />
<br />
== Experimental Result II: Image Completion for Digits ==<br />
<br />
[[File:002.jpg|600px|center]]<br />
<br />
They also tested CNP on the MNIST dataset and use the test<br />
set to evaluate its performance. As shown in the above figure the<br />
model learns to make good predictions of the underlying<br />
digit even for a small number of context points. Crucially,<br />
when conditioned only on one non-informative context point the model’s prediction corresponds<br />
to the average overall MNIST digits. As the number<br />
of context points increases the predictions become more<br />
similar to the underlying ground truth. This demonstrates<br />
the model’s capacity to extract dataset specific prior knowledge.<br />
It is worth mentioning that even with a complete set<br />
of observations, the model does not achieve pixel-perfect<br />
reconstruction, as we have a bottleneck at the representation<br />
level.<br />
<br />
To generate a coherent sample,<br />
the authors compute the representation r from the observations,<br />
which parametrizes a Gaussian distribution over the latents z.<br />
Then z sampled once and used to generate the predictions<br />
for all targets. To get a different coherent sample they draw a<br />
new sample from the latents z and run the decoder again for<br />
all targets.<br />
<br />
Since this implementation of CNP returns factored outputs,<br />
the best prediction it can produce given limited context<br />
information is to average over all possible predictions that<br />
agree with the context. An alternative to this is to add<br />
latent variables in the model such that they can be sampled<br />
conditioned on the context to produce predictions with high<br />
probability in the data distribution. <br />
<br />
In order to generate a coherent sample,<br />
we compute the representation r from the observations,<br />
which parametrizes a Gaussian distribution over the latents z.<br />
z is then sampled once and used to generate the predictions<br />
for all targets. To get a different coherent sample we draw a<br />
new sample from the latents z and run the decoder again for<br />
all targets.<br />
<br />
<br />
An important aspect of the model is its ability to estimate<br />
the uncertainty of the prediction. As shown in the bottom<br />
row of the above figure, as they added more observations, the variance<br />
shifts from being almost uniformly spread over the digit<br />
positions to being localized around areas that are specific<br />
to the underlying digit, specifically its edges. Being able to<br />
model the uncertainty given some context can be helpful for<br />
many tasks. One example is active exploration, where the<br />
model has a choice over where to observe.<br />
They tested this by<br />
comparing the predictions of CNP when the observations<br />
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active<br />
exploration, but it already produces better prediction results<br />
then selecting the conditioning points at random.<br />
<br />
== Experimental Result III: Image Completion for Faces ==<br />
<br />
<br />
[[File:003.jpg|400px|center]]<br />
<br />
<br />
They also applied CNP to CelebA, a dataset of images of<br />
celebrity faces and reported performance obtained on the<br />
test set.<br />
<br />
As shown in the above figure our model is able to capture<br />
the complex shapes and colors of this dataset with predictions<br />
conditioned on less than 10% of the pixels being<br />
already close to the ground truth. As before, given a few contexts<br />
points the model averages over all possible faces, but as<br />
the number of context pairs increases the predictions capture<br />
image-specific details like face orientation and facial<br />
expression. Furthermore, as the number of context points<br />
increases the variance is shifted towards the edges in the<br />
image.<br />
<br />
[[File:004.jpg|400px|center]]<br />
<br />
An important aspect of CNPs demonstrated in the above figure is<br />
it's flexibility not only in the number of observations and<br />
targets it receives but also with regards to their input values.<br />
It is interesting to compare this property to GPs on one hand,<br />
and to trained generative models (van den Oord et al., 2016;<br />
Gregor et al., 2015) on the other hand.<br />
The first type of flexibility can be seen when conditioning on<br />
subsets that the model has not encountered during training.<br />
Consider conditioning the model on one half of the image,<br />
fox example. This forces the model to not only predict the pixel<br />
values according to some stationary smoothness property of<br />
the images, but also according to global spatial properties,<br />
e.g. symmetry and the relative location of different parts of<br />
faces. As seen in the first row of the figure, CNPs are able to<br />
capture those properties. A GP with a stationary kernel cannot<br />
capture this, and in the absence of observations would<br />
revert to its mean (the mean itself can be non-stationary but<br />
usually, this would not be enough to capture the interesting<br />
properties).<br />
<br />
In addition, the model is flexible with regards to the target<br />
input values. This means, e.g., we can query the model<br />
at resolutions it has not seen during training. We take a<br />
model that has only been trained using pixel coordinates of<br />
a specific resolution and predict at test time subpixel values<br />
for targets between the original coordinates. As shown in<br />
Figure 5, with one forward pass we can query the model at<br />
different resolutions. While GPs also exhibit this type of<br />
flexibility, it is not the case for trained generative models,<br />
which can only predict values for the pixel coordinates on<br />
which they were trained. In this sense, CNPs capture the best<br />
of both worlds – it is flexible in regards to the conditioning<br />
and prediction task and has the capacity to extract domain<br />
knowledge from a training set.<br />
<br />
[[File:010.jpg|400px|center]]<br />
<br />
<br />
They compared CNPs quantitatively to two related models:<br />
kNNs and GPs. As shown in the above table CNPs outperform<br />
the latter when a number of context points are small (empirically<br />
when half of the image or less is provided as context).<br />
When the majority of the image is given as context exact<br />
methods like GPs and kNN will perform better. From the table<br />
we can also see that the order in which the context points<br />
are provided is less important for CNPs, since providing the<br />
context points in order from top to bottom still results in<br />
good performance. Both insights point to the fact that CNPs<br />
learn a data-specific ‘prior’ that will generate good samples<br />
even when the number of context points is very small.<br />
<br />
== Experimental Result IV: Classification ==<br />
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes of characters from 50 different alphabets. Each class has only 20 examples and as such this dataset is particularly suitable for few-shot learning algorithms. The authors used 1,200 randomly selected classes as their training set and the remainder as the testing data set.<br />
<br />
Additionally, to apply data augmentation the authors cropped the image from 32 × 32 to 28 × 28, applied small random<br />
translations and rotations to the inputs, and also increased<br />
the number of classes by rotating every character by 90<br />
degrees and defining that to be a new class. They generated<br />
the labels for an N-way classification task by choosing N<br />
random classes at each training step and arbitrarily assigning<br />
the labels <math>0, ..., N − 1</math> to each.<br />
<br />
<br />
[[File:008.jpg|400px|center]]<br />
<br />
Given that the input points are images, they modified the architecture<br />
of the encoder h to include convolution layers as<br />
mentioned in section 2. In addition, they only aggregated over<br />
inputs of the same class by using the information provided<br />
by the input label. The aggregated class-specific representations<br />
are then concatenated to form the final representation.<br />
Given that both the size of the class-specific representations<br />
and the number of classes is constant, the size of the final<br />
representation is still constant and thus the <math>O(n + m)</math><br />
runtime still holds.<br />
The results of the classification are summarized in the following table<br />
CNPs achieve higher accuracy than models that are significantly<br />
more complex (like MANN). While CNPs do not<br />
beat state of the art for one-shot classification our accuracy<br />
values are comparable. Crucially, they reached those values<br />
using a significantly simpler architecture (three convolutional<br />
layers for the encoder and a three-layer MLP for the<br />
decoder) and with a lower runtime of <math>O(n + m)</math> at test time<br />
as opposed to <math>O(nm)</math><br />
<br />
== Conclusion ==<br />
<br />
The paper introduced Conditional Neural Processes,<br />
a model that is both flexible at test time and has the<br />
capacity to extract prior knowledge from training data.<br />
<br />
The authors had demonstrated its ability to perform a variety of tasks<br />
including regression, classification and image completion.<br />
The paper compared CNP's to Gaussian Processes on one hand, and<br />
deep learning methods on the other, and also discussed the<br />
relation to meta-learning and few-shot learning.<br />
It is important to note that the specific CNP implementations<br />
described here are just simple proofs-of-concept and can<br />
be substantially extended, e.g. by including more elaborate<br />
architectures in line with modern deep learning advances.<br />
To summarize, this work can be seen as a step towards learning<br />
high-level abstractions, one of the grand challenges of<br />
contemporary machine learning. Functions learned by most<br />
Conditional Neural Processes<br />
conventional deep learning models are tied to a specific, constrained<br />
statistical context at any stage of training. A trained<br />
CNP is more general, in that it encapsulates the high-level<br />
statistics of a family of functions. As such it constitutes a<br />
high-level abstraction that can be reused for multiple tasks.<br />
In future work, they are going to explore how far these models can<br />
help in tackling the many key machine learning problems<br />
that seem to hinge on abstraction, such as transfer learning,<br />
meta-learning, and data efficiency.<br />
<br />
== Critiques ==<br />
<br />
This paper introduces a method, for reducing the computational complexity of the more famous Gaussian Processes model, but they have mentioned a complexity of O(n + m) which is almost the same order of RBF kernel GP. With respect to performances in a sequence of tasks, the authors have not made metric comparisons to GP methods to prove the superiority of their approach.<br />
<br />
It appears that the proposed model is effective in making accurate predictions using lower quality inputs. For example, a dataset with fewer data points or an image with fewer pixels. However, it is not clear whether the proposed algorithm can be trained with a smaller amount of input data.<br />
<br />
== Other Sources ==<br />
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]<br />
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]<br />
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]<br />
<br />
== Reference ==<br />
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative<br />
models with generative matching networks. arXiv<br />
preprint arXiv:1612.02192, 2016.<br />
<br />
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,<br />
D. Weight uncertainty in neural networks. arXiv preprint<br />
arXiv:1505.05424, 2015.<br />
<br />
Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.<br />
Variational memory addressing in generative models. In<br />
Advances in Neural Information Processing Systems, pp.<br />
3923–3932, 2017.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.<br />
<br />
Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and<br />
Kohli, P. Neural program meta-induction. In Advances in<br />
Neural Information Processing Systems, pp. 2077–2085,<br />
2017.<br />
<br />
Edwards, H. and Storkey, A. Towards a neural statistician.<br />
2016.<br />
<br />
Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning<br />
for fast adaptation of deep networks. arXiv<br />
preprint arXiv:1703.03400, 2017.<br />
<br />
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:<br />
Representing model uncertainty in deep learning.<br />
In international conference on machine learning, pp.<br />
1050–1059, 2016.<br />
<br />
Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards<br />
deep symbolic reinforcement learning. arXiv preprint<br />
arXiv:1609.05518, 2016.<br />
<br />
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and<br />
Wierstra, D. Draw: A recurrent neural network for image<br />
generation. arXiv preprint arXiv:1502.04623, 2015.<br />
<br />
Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The<br />
variational homoencoder: Learning to infer high-capacity<br />
generative models from few examples. 2018.<br />
<br />
J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,<br />
et al. One-shot generalization in deep generative models.<br />
In International Conference on Machine Learning, pp.<br />
1521–1529, 2016.<br />
<br />
Kingma, D. P. and Ba, J. Adam: A method for stochastic<br />
optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
Kingma, D. P. and Welling, M. Auto-encoding variational<br />
bayes. arXiv preprint arXiv:1312.6114, 2013.<br />
<br />
Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural<br />
networks for one-shot image recognition. In ICML Deep<br />
Learning Workshop, volume 2, 2015.<br />
<br />
Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.<br />
Human-level concept learning through probabilistic program<br />
induction. Science, 350(6266):1332–1338, 2015.<br />
<br />
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,<br />
S. J. Building machines that learn and think like<br />
people. Behavioral and Brain Sciences, 40, 2017.<br />
<br />
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased<br />
learning applied to document recognition. Proceedings<br />
of the IEEE, 86(11):2278–2324, 1998.<br />
<br />
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face<br />
attributes in the wild. In Proceedings of International<br />
Conference on Computer Vision (ICCV), December 2015.<br />
<br />
Louizos, C. and Welling, M. Multiplicative normalizing<br />
flows for variational bayesian neural networks. arXiv<br />
preprint arXiv:1703.01961, 2017.<br />
<br />
Louizos, C., Ullrich, K., and Welling, M. Bayesian compression<br />
for deep learning. In Advances in Neural Information<br />
Processing Systems, pp. 3290–3300, 2017.<br />
<br />
Rasmussen, C. E. and Williams, C. K. Gaussian processes<br />
in machine learning. In Advanced lectures on machine<br />
learning, pp. 63–71. Springer, 2004.<br />
<br />
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,<br />
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot<br />
autoregressive density estimation: Towards learning to<br />
learn distributions. 2017.<br />
<br />
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic<br />
backpropagation and approximate inference in deep generative<br />
models. arXiv preprint arXiv:1401.4082, 2014.<br />
<br />
Salimbeni, H. and Deisenroth, M. Doubly stochastic variational<br />
inference for deep gaussian processes. In Advances<br />
in Neural Information Processing Systems, pp.<br />
4591–4602, 2017.<br />
<br />
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and<br />
Lillicrap, T. One-shot learning with memory-augmented<br />
neural networks. arXiv preprint arXiv:1605.06065, 2016.<br />
<br />
Snell, J., Swersky, K., and Zemel, R. Prototypical networks<br />
for few-shot learning. In Advances in Neural Information<br />
Processing Systems, pp. 4080–4090, 2017.<br />
<br />
Snelson, E. and Ghahramani, Z. Sparse gaussian processes<br />
using pseudo-inputs. In Advances in neural information<br />
processing systems, pp. 1257–1264, 2006.<br />
<br />
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,<br />
O., Graves, A., et al. Conditional image generation with<br />
pixelcnn decoders. In Advances in Neural Information<br />
Processing Systems, pp. 4790–4798, 2016.<br />
<br />
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.<br />
Matching networks for one shot learning. In Advances in<br />
Neural Information Processing Systems, pp. 3630–3638,<br />
2016.<br />
<br />
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,<br />
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and<br />
Botvinick, M. Learning to reinforcement learn. arXiv<br />
preprint arXiv:1611.05763, 2016.<br />
<br />
Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.<br />
Deep kernel learning. In Artificial Intelligence and Statistics,<br />
pp. 370–378, 2016.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map&diff=42397Learning to Navigate in Cities Without a Map2018-12-11T23:08:49Z<p>Msminhas: Technical,Editorial</p>
<hr />
<div>Paper: <br />
[https://arxiv.org/pdf/1804.00168.pdf Learning to Navigate in Cities Without a Map]<br />
A video of the paper is available [https://sites.google.com/view/streetlearn here].<br />
<br />
== Introduction ==<br />
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.<br />
<br />
1. Building an explicit map<br />
<br />
2. Planning and acting using that map. <br />
<br />
In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.<br />
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (neither map nor current position have not been used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]<br />
<br />
==Contribution==<br />
This paper has made the following contributions:<br />
<br />
1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.<br />
<br />
2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.<br />
<br />
3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.<br />
<br />
4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.<br />
<br />
The authors demonstrate that their proposed method can provide a mechanism for transferring knowledge to new cities. As with humans, when the agent visits a new city, the expectation is it to have it learn a new set of landmarks, but not to have to re-learn its visual representations or its behaviours (e.g., zooming forward along streets or turning at intersections). Therefore, using the MultiCity architecture, the paper trains first on a number of cities, then freezes both the policy network and the visual convolutional network and only a new locale-specific pathway on a new city. This approach enables the agent to acquire new knowledge without forgetting what it has already learned, similarly to the progressive neural networks architecture.<br />
<br />
==Related Work==<br />
<br />
1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment. <br />
<br />
2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.<br />
<br />
3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.<br />
<br />
4. To make simulations resemble reality, researchers have developed higher-fidelity simulated environments (Dosovitskiy et al., 2017; Kolve et al., 2017; Shah et al., 2018; Wu et al., 2018). However, in spite of the photo-realism, the inherent problems of simulated environments pertain to the limited diversity of the environments and the idealistic cleanliness of the observations.<br />
<br />
==Environment==<br />
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes<br />
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to<br />
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.<br />
<br />
[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]<br />
<br />
==Agent Interface and the Courier Task==<br />
In an RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action). The most central edge is chosen if there are multiple edges in the agents viewing cone.<br />
<br />
There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp(-αd_{(t,i)}^g)}{∑_k exp(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).<br />
<br />
[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]<br />
<br />
This form of representation has several advantages: <br />
<br />
1. It could easily be extended to new environments.<br />
<br />
2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.<br />
<br />
3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.<br />
<br />
In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.<br />
<br />
==Methods==<br />
<br />
===Goal-dependent Actor-Critic Reinforcement Learning===<br />
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} &rarr; \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of<br />
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>: <br />
<br />
\begin{align}<br />
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)<br />
\end{align}<br />
<br />
Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:<br />
<br />
\begin{align}<br />
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]<br />
\end{align}<br />
<br />
Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.<br />
<br />
===Architectures===<br />
<br />
[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]<br />
<br />
The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.<br />
<br />
Three different architectures are described below.<br />
<br />
The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.<br />
<br />
The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information. <br />
<br />
The '''MultiCityNav''' architecture (Fig. 4c) is an extension of CityNav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.<br />
<br />
In this paper, the authors use IMPALA [1] to train the agents because IMPALA can get similar performance to A3C [2].<br />
<br />
===Prior on agent training: IMPALA and A3C===<br />
<br />
IMPALA (Importance Weighted Actor-Learner Architecture) is an actor-critic implementation of deep reinforcement learning that decouples actions from learning. IMPALA results in a comparable performance to A3C (Google DeepMind's previous algorithm: Asynchronous Actor-Critic Agents) on a single city task, but it has been shown to handle better multi-task learning than A3C. The authors use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50.<br />
<br />
===Curriculum Learning===<br />
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)<br />
<br />
Curriculum learning was first introduced by Bengio et. al in 2009. It serves as a continuation method for non-convex optimization, and improves training time by injecting noisy data. One example outside this paper for curriculum learning is outlined below:<br />
<br />
1. We aim to classify shapes within the following three classes: triangles, ellipses, and rectangles. We can create a curriculum by first starting with a simplified dataset that consists of only special cases of these three classes: equilateral triangles, circles, and squares. By first training on these special cases, and then introducing the full model, we can allow the algorithm to converge more quickly towards a local minima before providing "harder" examples. Feeding only these specialized examples also serves as a method to make the classes fall on more distinct manifold locations; with less overlap, these networks will perform better when noise is later added as well.<br />
<br />
==Results==<br />
In this section, the performance of the proposed architectures on the courier task is shown.<br />
<br />
[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way<br />
as in London) are comparable and the agent achieved mean goal reward 426.]]<br />
<br />
It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:<br />
<br />
1. Goal Navigation agent.<br />
<br />
2. City Navigation Agent.<br />
<br />
3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.<br />
<br />
Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.<br />
<br />
The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.<br />
<br />
[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]<br />
<br />
Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.<br />
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach<br />
a goal (Washington Square in New York or St Paul’s Cathedral in<br />
London) from 100 start locations vs. the straight-line distance to<br />
the goal in meters. One agent step corresponds to a forward movement<br />
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]<br />
<br />
The authors mask 25% of the possible goals and train on the remaining ones in order to investigate the generalisation capability of a trained agent. At test time the agent is evaluated only on its ability to reach goals in the held-out areas. The agent is still able to traverse through the areas, however, it just never samples a goal there. The CityNav agent is trained for 1B steps and then the weights of the agent are frozen and performance evaluated on held-out areas for 100M steps. Experiments showed decreasing performance of the agents as the held-out area size increased. It was observed that while the agent misses more goal destinations on larger held-out grids it still manages to travel half the distance to the goal withing a similar time. This suggests that the agent has an approximate held-out goal representation that enables it to head towards it until it gets close to the goal and the representation is no longer useful for the final approach.<br />
[[File:fff8.png|600px|center]]<br />
<br />
A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.<br />
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]<br />
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]<br />
<br />
Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:<br />
<br />
a) Latitude and longitude scalar coordinates normalized to be between 0 and 1. This is based on the region which the agent navigates.<br />
<br />
b) Binned representation. <br />
<br />
The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.<br />
<br />
[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]<br />
<br />
==Conclusion==<br />
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented through the use of Google StreetView for its photographic content and worldwide coverage. Furthermore, the authors discussed a new courier task and a multi-city neural network agent architecture that is transferable to new cities. A successful navigation architecture is presented which relies on integration of general policies with locale-specific knowledge.<br />
<br />
==Future Works==<br />
The paper uses staic Google Street View images. However, this means that there are some more information that we can get from the images beyond the route. Even though it is not the central focus of the paper, it would be extremely useful if we can incorporate such information for effective route-building or planning.<br />
<br />
[[File:picture1.png|400px|thumb|center|Figure 12. LearningcurvesoftheCityNavagent(2LSTM+Skip+HD) on NYU, comparing different ablations, allthey way down toGoalNav(LSTM). 2LSTM architectures havea global pathway LSTM and a policy LSTM with optional Skipconnection between the convnet and the policy LSTM. HD is theheading prediction auxiliary task.]]<br />
<br />
==Critique==<br />
1. It is not clear how this model is applicable to the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modeling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.<br />
<br />
2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world. Also, this is quite understandable not to use maps but is not clear why have they not used GPS to know their position and maybe even made up with a map. This can be something useful in an emergency or even for investigating places that are not known or there is no access to them. The resulting map could be easily compared with the real one and could also be used in training to achieve higher performance. The availability should not be a serious problem because if they are simulating a real city and the google images are available, why should not GPS be? What is the intuition? At least, a complementary description on this could be helpful.<br />
<br />
3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.<br />
<br />
4. The proposed navigation model could be limited by its reliance on pre-defined landmarks, which appears to be strategically placed evenly spreading across each city. This could limit the agent's deployability to new cities.<br />
<br />
==Reference==<br />
[1] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, and Kavukcuoglu, Koray. Impala: Scalable distributed deep-rl with importance weighted actor-learner architec- tures. arXiv preprint arXiv:1802.01561, 2018.<br />
<br />
[2] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning, pp. 1928–1937, 2016.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42395Countering Adversarial Images Using Input Transformations2018-12-11T22:50:11Z<p>Msminhas: Technical</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them. The authors best defenses eliminate 60% of gray-box attacks and 90% of black-box attacks by four major attack methods that perturb pixel values by 8% on average.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are public.<br />
<br />
'''Black Box Attack''': Consider a weak adversary with access to the DNN output only. The adversary has no knowledge<br />
of the architectural choices made to design the DNN, which include the number, type, and size of layers, nor of<br />
the training data used to learn the DNN’s parameters. Such attacks are referred to as black box, where adversaries need<br />
not know internal details of a system to compromise it [18].<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims to make the prediction on an adversarial example <math> h(x') </math> equal to the prediction on the corresponding clean example <math> h(x) </math>.<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient black-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defence strategy that is being used.<br />
<br />
A defence is an approach that aims make the prediction on an adversarial example <math>h(x')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, the authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation which is important in making attacks successful. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. During the bit reduction the input and output are in the same numerical scale. For reducing to -bit depth the input value is multiplied with <math>2^{i}-1</math> and then rounded to integers. The integers are then scaled back to the original range by dividing by <math>2^{i}-1</math>. The information capacity of the representation is reduced from 8-bit to i-bit with the integer rounding operation. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
If we take a look at the effect of image quilting in the above figure, although interpretation of these images is more complicated due to the quantization errors that image quilting introduces, we can still observe that the absolute differences between quilted original and the quilted adversarial image appear to be smaller in non-homogeneous regions of the image. Based on this observation the authors suggest that TV minimization and image quilting lead to inherently different defenses.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. However, it may still be possible to train a network to perhaps act as an approximation to the non-differentiable transformation. <br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.<br />
<br />
18. Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42393Countering Adversarial Images Using Input Transformations2018-12-11T22:30:32Z<p>Msminhas: Undo revision 42392 by Msminhas (talk)</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them. The authors best defenses eliminate 60% of gray-box attacks and 90% of black-box attacks by four major attack methods that perturb pixel values by 8% on average.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are public.<br />
<br />
'''Black Box Attack''': Consider a weak adversary with access to the DNN output only. The adversary has no knowledge<br />
of the architectural choices made to design the DNN, which include the number, type, and size of layers, nor of<br />
the training data used to learn the DNN’s parameters. Such attacks are referred to as black box, where adversaries need<br />
not know internal details of a system to compromise it [18].<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims to make the prediction on an adversarial example <math> h(x') </math> equal to the prediction on the corresponding clean example <math> h(x) </math>.<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient black-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defence strategy that is being used.<br />
<br />
A defence is an approach that aims make the prediction on an adversarial example <math>h(x')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, the authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation which is important in making attacks successful. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
If we take a look at the effect of image quilting in the above figure, although interpretation of these images is more complicated due to the quantization errors that image quilting introduces, we can still observe that the absolute differences between quilted original and the quilted adversarial image appear to be smaller in non-homogeneous regions of the image. Based on this observation the authors suggest that TV minimization and image quilting lead to inherently different defenses.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. However, it may still be possible to train a network to perhaps act as an approximation to the non-differentiable transformation. <br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.<br />
<br />
18. Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations&diff=42392Countering Adversarial Images Using Input Transformations2018-12-11T22:29:08Z<p>Msminhas: Technical</p>
<hr />
<div>The code for this paper is available here[https://github.com/facebookresearch/adversarial_image_defenses]<br />
<br />
==Motivation ==<br />
As the use of machine intelligence has increased, robustness has become a critical feature to guarantee the reliability of deployed machine-learning systems. However, recent research has shown that existing models are not robust to small, adversarially designed perturbations to the input. Adversarial examples are inputs to Machine Learning models so that an attacker has intentionally designed to cause the model to make a mistake. Adversarially perturbed examples have been deployed to attack image classification services (Liu et al., 2016)[11], speech recognition systems (Cisse et al., 2017a)[12], and robot vision (Melis et al., 2017)[13]. The existence of these adversarial examples has motivated proposals for approaches that increase the robustness of learning systems to such examples. In the example below (Goodfellow et. al) [17], a small perturbation is applied to the original image of a panda, changing the prediction to a gibbon.<br />
<br />
[[File:Panda.png|center]]<br />
<br />
==Introduction==<br />
The paper studies strategies that defend against adversarial example attacks on image classification systems by transforming the images before feeding them to a Convolutional Network Classifier. <br />
Generally, defenses against adversarial examples fall into two main categories:<br />
<br />
# Model-Specific – They enforce model properties such as smoothness and invariance via the learning algorithm. <br />
# Model-Agnostic – They try to remove adversarial perturbations from the input. <br />
<br />
Model-specific defense strategies make strong assumptions about expected adversarial attacks. As a result, they violate the Kerckhoffs principle, which states that adversaries can circumvent model-specific defenses by simply changing how an attack is executed. This paper focuses on increasing the effectiveness of model-agnostic defense strategies. Specifically, they investigated the following image transformations as a means for protecting against adversarial images:<br />
<br />
# Image Cropping and Re-scaling (Graese et al, 2016). <br />
# Bit Depth Reduction (Xu et al, 2017) <br />
# JPEG Compression (Dziugaite et al, 2016) <br />
# Total Variance Minimization (Rudin et al, 1992) <br />
# Image Quilting (Efros & Freeman, 2001). <br />
## It is a image-based method of generating novel visual appearance in which a new image is synthesized by stitching together small patches of existing images. The method is used for texture synthesis and in the paper it was used to perform texture transfer thereby rendering an object with a texture taken from a different object.<br />
<br />
These image transformations have been studied against Adversarial attacks such as the fast gradient sign method (Goodfelow et. al., 2015), its iterative extension (Kurakin et al., 2016a), Deepfool (Moosavi-Dezfooli et al., 2016), and the Carlini & Wagner (2017) <math>L_2</math>attack. <br />
<br />
The authors in this paper try to focus on increasing the effectiveness of model-agnostic defense strategies through approaches that:<br />
# remove the adversarial perturbations from input images,<br />
# maintain sufficient information in input images to correctly classify them,<br />
# and are still effective in situations where the adversary has information about the defense strategy being used.<br />
<br />
From their experiments, the strongest defenses are based on Total Variance Minimization and Image Quilting. These defenses are non-differentiable and inherently random which makes it difficult for an adversary to get around them. The authors best defenses eliminate 60% of gray-box attacks and 90% of black-box attacks by four major attack methods that perturb pixel values by 8% on average.<br />
<br />
==Previous Work==<br />
Recently, a lot of research has focused on countering adversarial threats. Wang et al [4], proposed a new adversary resistant technique that obstructs attackers from constructing impactful adversarial images. This is done by randomly nullifying features within images. Tramer et al [2], showed the state-of-the-art Ensemble Adversarial Training Method, which augments the training process but not only included adversarial images constructed from their model but also including adversarial images generated from an ensemble of other models. Their method implemented on an Inception V2 classifier finished 1st among 70 submissions of NIPS 2017 competition on Defenses against Adversarial Attacks. Graese, et al. [3], showed how input transformation such as shifting, blurring and noise can render the majority of the adversarial examples as non-adversarial. Xu et al.[5] demonstrated, how feature squeezing methods, such as reducing the color bit depth of each pixel and spatial smoothing, defends against attacks. Dziugaite et al [6], studied the effect of JPG compression on adversarial images. Chen et al. [7] introduce an advanced denoising algorithm with GAN based noise modeling in order to improve the blind denoising performance in low-level vision processing. The GAN is trained to estimate the noise distribution over the input noisy images and to generate noise samples. Although meant for image processing, this method can be generalized to target adversarial examples where the unknown noise generating algorithm can be leveraged.<br />
<br />
==Terminology==<br />
<br />
'''Gray Box Attack''' : Model Architecture and parameters are public.<br />
<br />
'''Black Box Attack''': Consider a weak adversary with access to the DNN output only. The adversary has no knowledge<br />
of the architectural choices made to design the DNN, which include the number, type, and size of layers, nor of<br />
the training data used to learn the DNN’s parameters. Such attacks are referred to as black box, where adversaries need<br />
not know internal details of a system to compromise it [18].<br />
<br />
An interesting and important observation of adversarial examples is that they generally are not model or architecture specific. Adversarial examples generated for one neural network architecture will transfer very well to another architecture. In other words, if you wanted to trick a model you could create your own model and adversarial examples based off of it. Then these same adversarial examples will most probably trick the other model as well. This has huge implications as it means that it is possible to create adversarial examples for a completely black box model where we have no prior knowledge of the internal mechanics. [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]<br />
<br />
'''Non Targeted Adversarial Attack''': The goal of the attack is to modify a source image in a way such that the image will be classified incorrectly by the network.<br />
<br />
This is an example on non-targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:non-targeted O.JPG| 600px|center]]<br />
<br />
'''Targeted Adversarial Attack''': The goal of the attack is to modify a source image in way such that image will be classified as a ''target'' class by the network.<br />
<br />
This is an example on targeted adversarial attacks to be more clear [https://ml.berkeley.edu/blog/2018/01/10/adversarial-examples/ reference]:<br />
[[File:Targeted O.JPG| 600px|center]]<br />
<br />
'''Defense''': A defense is a strategy that aims to make the prediction on an adversarial example <math> h(x') </math> equal to the prediction on the corresponding clean example <math> h(x) </math>.<br />
<br />
== Problem Definition ==<br />
The paper discusses non-targeted adversarial attacks for image recognition systems. Given image space <math>\mathcal{X} = [0,1]^{H \times W \times C}</math>, a source image <math>x \in \mathcal{X}</math>, and a classifier <math>h(.)</math>, a non-targeted adversarial example of <math>x</math> is a perturbed image <math>x'</math>, such that <math>h(x) \neq h(x')</math> and <math>d(x, x') \leq \rho</math> for some dissimilarity function <math>d(·, ·)</math> and <math>\rho \geq 0</math>. In the best case scenario, <math>d(·, ·)</math> measures the perceptual difference between the original image <math>x</math> and the perturbed image <math>x'</math>, but usually, Euclidean distance (<math>||x - x'||_2</math>) or the Chebyshov distance (<math>||x - x'||_{\infty}</math>) are used.<br />
<br />
From a set of N clean images <math>[{x_{1}, …, x_{N}}]</math>, an adversarial attack aims to generate <math>[{x'_{1}, …, x'_{N}}]</math> images, such that (<math>x'_{n}</math>) is an adversary of (<math>x_{n}</math>).<br />
<br />
The success rate of an attack is given as: <br />
<br />
<center><math><br />
\frac{1}{N}\sum_{n=1}^{N}I[h(x_n) &ne; h({x_n}^\prime)],<br />
</math></center><br />
<br />
which is the proportions of predictions that were altered by an attack.<br />
<br />
The success rate is generally measured as a function of the magnitude of perturbations performed by the attack. In this paper, L2 perturbations are used and are quantified using the normalized L2-dissimilarity metric:<br />
<math> \frac{1}{N} \sum_{n=1}^N{\frac{\vert \vert x_n - x'_n \vert \vert_2}{\vert \vert x_n \vert \vert_2}} </math><br />
<br />
A strong adversarial attack has a high rate, while its normalized L2-dissimilarity given by the above equation is less.<br />
<br />
In most practical settings, an adversary does not have direct access to the model <math>h(·)</math> and has to do a black-box attack. <br />
<br />
However, prior work has shown successful attacks by transferring adversarial examples generated for a separately-trained model to an unknown target model (Liu et al., 2016), thus allowing efficient black-box attack. <br />
<br />
As a result, the authors investigate both the black-box and a more difficult gray-box attack setting: the adversary has access to the model architecture and the model parameters, but<br />
is unaware of the defence strategy that is being used.<br />
<br />
A defence is an approach that aims make the prediction on an adversarial example <math>h(x')</math> equal to the prediction on the corresponding clean example <math>h(x)</math>. In this study, the authors focus on image transformation defenses <math>g(x)</math> that perform prediction via <math>h(g(x'))</math>. Ideally, <math>g(·)</math> is a complex, non-differentiable, and potentially stochastic function: this makes it difficult for an adversary to attack the prediction model <math>h(g(x))</math> even when the adversary knows both <math>h(·)</math> and <math>g(·)</math>.<br />
<br />
==Adversarial Attacks==<br />
<br />
Although the exact effect that adversarial examples have on the network is unknown, Ian Goodfellow et. al's Deep Learning book states that adversarial examples exploit the linearity of neural networks to perturb the cost function to force incorrect classifications. Images are often high resolution, and thus have thousands of pixels (millions for HD images). An epsilon ball perturbation when dimensionality is in the magnitude of thousands/millions greatly effects the cost function (especially if it increases loss at every pixel). Hence, although the following methods such as FGSM and Iterative FGSM are very straightforward, they greatly influence the network under a white box attack. <br />
<br />
For the experimental purposes, below 4 attacks have been studied in the paper:<br />
<br />
1. '''Fast Gradient Sign Method (FGSM; Goodfellow et al. (2015)) [17]''': Given a source input <math>x</math>, and true label <math>y</math>, and let <math>l(.,.)</math> be the differentiable loss function used to train the classifier <math>h(.)</math>. Then the corresponding adversarial example is given by:<br />
<br />
<math>x' = x + \epsilon \cdot sign(\nabla_x l(x, y))</math><br />
<br />
for some <math>\epsilon \gt 0</math> which controls the perturbation magnitude.<br />
<br />
2. '''Iterative FGSM ((I-FGSM; Kurakin et al. (2016b)) [14]''': iteratively applies the FGSM update, where M is the number of iterations. It is given as:<br />
<br />
<math>x^{(m)} = x^{(m-1)} + \epsilon \cdot sign(\nabla_{x^{m-1}} l(x^{m-1}, y))</math><br />
<br />
where <math>m = 1,...,M; x^{(0)} = x;</math> and <math>x' = x^{(M)}</math>. M is set such that <math>h(x) \neq h(x')</math>.<br />
<br />
Both FGSM and I-FGSM work by minimizing the Chebyshev distance between the inputs and the generated adversarial examples.<br />
<br />
3. '''DeepFool ((Moosavi-Dezfooliet al., 2016) [15]''': projects x onto a linearization of the decision boundary defined by binary classifier h(.) for M iterations. This can be particularly effictive when a network uses ReLU activation functions. It is given as:<br />
<br />
[[File:DeepFool.PNG|400px |]]<br />
<br />
4. '''Carlini-Wagner's L2 attack (CW-L2; Carlini & Wagner (2017)) [16]''': propose an optimization-based attack that combines a differentiable surrogate for the model’s classification accuracy with an L2-penalty term which encourages the adversary image to be close to the original image. Let <math>Z(x)</math> be the operation that computes the logit vector (i.e., the output before the softmax layer) for an input <math>x</math>, and <math>Z(x)_k</math> be the logit value corresponding to class <math>k</math>. The untargeted variant<br />
of CW-L2 finds a solution to the unconstrained optimization problem. It is given as:<br />
<br />
[[File:Carlini.PNG|500px |]]<br />
<br />
As mentioned earlier, the first two attacks minimize the Chebyshev distance whereas the last two attacks minimize the Euclidean distance between the inputs and the adversarial examples.<br />
<br />
All the methods described above maintain <math>x' \in \mathcal{X}</math> by performing value clipping. <br />
<br />
Below figure shows adversarial images and corresponding perturbations at five levels of normalized L2-dissimilarity for all four attacks, mentioned above.<br />
<br />
[[File:Strength.PNG|thumb|center| 600px |Figure 1: Adversarial images and corresponding perturbations at five levels of normalized L2- dissimilarity for all four attacks.]]<br />
<br />
==Defenses==<br />
Defense is a strategy that aims to make the prediction on an adversarial example equal to the prediction on the corresponding clean example, and the particular structure of adversarial perturbations <math> x-x' </math> have been shown in Figure 1.<br />
Five image transformations that alter the structure of these perturbations have been studied:<br />
# Image Cropping and Re-scaling, <br />
# Bit Depth Reduction, <br />
# JPEG Compression, <br />
# Total Variance Minimization, <br />
# Image Quilting.<br />
<br />
'''Image cropping and Rescaling''' has the effect of altering the spatial positioning of the adversarial perturbation which is important in making attacks successful. In this study, images are cropped and re-scaled during training time as part of data-augmentation. At test time, the predictions of randomly cropped are averaged.<br />
<br />
'''Bit Depth Reduction (Xu et. al) [5]''' performs a simple type of quantization that can remove small (adversarial) variations in pixel values from an image. Images are reduced to 3 bits in the experiment.<br />
<br />
'''JPEG Compression and Decompression (Dziugaite etal., 2016)''' removes small perturbations by performing simple quantization. The authors use a quality level of 75/100 in their experiments<br />
<br />
'''Total Variance Minimization (Rudin et. al) [9]''' :<br />
This combines pixel dropout with total variance minimization. This approach randomly selects a small set of pixels, and reconstructs the “simplest” image that is consistent with the selected pixels. The reconstructed image does not contain the adversarial perturbations because these perturbations tend to be small and localized.Specifically, we first select a random set of pixels by sampling a Bernoulli random variable <math>X(i; j; k)</math> for each pixel location <math>(i; j; k)</math>;we maintain a pixel when <math>(i; j; k)</math>= 1. Next, we use total variation, minimization to constructs an image z that is similar to the (perturbed) input image x for the selected<br />
set of pixels, whilst also being “simple” in terms of total variation by solving:<br />
<br />
[[File:TV!.png|300px|]] , <br />
<br />
where <math>TV_{p}(z)</math> represents <math>L_{p}</math> total variation of '''z''' :<br />
<br />
[[File:TV2.png|500px|]]<br />
<br />
The total variation (TV) measures the amount of fine-scale variation in the image z, as a result of which TV minimization encourages removal of small (adversarial) perturbations in the image. The objective function is convex in <math>z</math>, which makes solving for z straightforward. In the paper, p = 2 and a special-purpose solver based on the split Bregman method (Goldstein & Osher, 2009) to perform total variance minimization efficiently is employed.<br />
The effectiveness of TV minimization is illustrated by the images in the middle column of the figure below: in particular, note that the adversarial perturbations that were present in the background for the non- transformed image (see bottom-left image) have nearly completely disappeared in the TV-minimized adversarial image (bottom-center image). As expected, TV minimization also changes image structure in non-homogeneous regions of the image, but as these perturbations were not adversarially designed we expect the negative effect of these changes to be limited.<br />
<br />
[[File:tvx.png]]<br />
<br />
The figure above represents an illustration of total variance minimization and image quilting applied to an original and an adversarial image (produced using I-FGSM with ε = 0.03, corresponding to a normalized L2 - dissimilarity of 0.075). From left to right, the columns correspond to (1) no transformation, (2) total variance minimization, and (3) image quilting. From top to bottom, rows correspond to: (1) the original image, (2) the corresponding adversarial image produced by I-FGSM, and (3) the absolute difference between the two images above. Difference images were multiplied by a constant scaling factor to increase visibility.<br />
<br />
<br />
'''Image Quilting (Efros & Freeman, 2001) [8]'''<br />
Image Quilting is a non-parametric technique that synthesizes images by piecing together small patches that are taken from a database of image patches. The algorithm places appropriate patches in the database for a predefined set of grid points and computes minimum graph cuts in all overlapping boundary regions to remove edge artifacts. Image Quilting can be used to remove adversarial perturbations by constructing a patch database that only contains patches from "clean" images ( without adversarial perturbations); the patches used to create the synthesized image are selected by finding the K nearest neighbors ( in pixel space) of the corresponding patch from the adversarial image in the patch database, and picking one of these neighbors uniformly at random. The motivation for this defense is that resulting image only contains pixels that were not modified by the adversary - the database of real patches is unlikely to contain the structures that appear in adversarial images.<br />
<br />
If we take a look at the effect of image quilting in the above figure, although interpretation of these images is more complicated due to the quantization errors that image quilting introduces, we can still observe that the absolute differences between quilted original and the quilted adversarial image appear to be smaller in non-homogeneous regions of the image. Based on this observation the authors suggest that TV minimization and image quilting lead to inherently different defenses.<br />
<br />
=Experiments=<br />
<br />
Five experiments were performed to test the efficacy of defenses. The first four experiments consider gray and black box attacks. The gray-box attack applies defenses on input adversarial images for the convolutional networks. The adversary is able to read model architecture and parameters but not the defense strategy. The black-box attack replaces convolutional network by a trained network with image-transformations. The final experiment compares the authors' defenses with prior work. <br />
<br />
'''Set up:'''<br />
Experiments are performed on the ImageNet image classification dataset. The dataset comprises 1.2 million training images and 50,000 test images that correspond to one of 1000 classes. The adversarial images are produced by attacking a ResNet-50 model, with different kinds of attacks mentioned in Section5. The strength of an adversary is measured in terms of its normalized L2-dissimilarity. To produce the adversarial images, L2 dissimilarity for each of the attack was set as below:<br />
<br />
- FGSM. Increasing the step size <math>\epsilon</math>, increases the normalized L2-dissimilarity.<br />
<br />
- I-FGSM. We fix M=10, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- DeepFool. We fix M=5, and increase <math>\epsilon</math> to increase the normalized L2-dissimilarity.<br />
<br />
- CW-L2. We fix <math>k</math>=0 and <math>\lambda_{f}</math> =10, and multiply the resulting perturbation <br />
<br />
The hyperparameters of the defenses have been fixed in all the experiments. Specifically the pixel dropout probability was set to <math>p</math>=0.5 and regularization parameter of total variation minimizer <math>\lambda_{TV}</math>=0.03.<br />
<br />
Below figure shows the difference between the set up in different experiments below. The network is either trained on a) regular images or b) transformed images. The different settings are marked by 8.1, 8.2 and 8.3 <br />
[[File:models3.png |center]] <br />
<br />
==GrayBox - Image Transformation at Test Time== <br />
This experiment applies a transformation on adversarial images at test time before feeding them to a ResNet -50 which was trained to classify clean images. Below figure shows the results for five different transformations applied and their corresponding Top-1 accuracy. Few of the interesting observations from the plot are: All of the image transformations partly eliminate the effects of the attack, Crop ensemble gives the best accuracy around 40-60 percent, with an ensemble size of 30. The accuracy of Image Quilting Defense hardly deteriorates as the strength of the adversary increases. However, it does impact accuracy on non-adversarial examples.<br />
<br />
[[File:sFig4.png|center|600px |]]<br />
<br />
==BlackBox - Image Transformation at Training and Test Time==<br />
ResNet-50 model was trained on transformed ImageNet Training images. Before feeding the images to the network for training, standard data augmentation (from He et al) along with bit depth reduction, JPEG Compression, TV Minimization, or Image Quilting were applied on the images. The classification accuracy on the same adversarial images as in the previous case is shown Figure below. (Adversary cannot get this trained model to generate new images - Hence this is assumed as a Black Box setting!). Below figure concludes that training Convolutional Neural Networks on images that are transformed in the same way at test time, dramatically improves the effectiveness of all transformation defenses. Nearly 80 -90 % of the attacks are defended successfully, even when the L2- dissimilarity is high.<br />
<br />
<br />
[[File:sFig5.png|center|600px |]]<br />
<br />
<br />
==Blackbox - Ensembling==<br />
Four networks ResNet-50, ResNet-10, DenseNet-169, and Inception-v4 along with an ensemble of defenses were studied, as shown in Table 1. The adversarial images are produced by attacking a ResNet-50 model. The results in the table conclude that Inception-v4 performs best. This could be due to that network having a higher accuracy even in non-adversarial settings. The best ensemble of defenses achieves an accuracy of about 71% against all the other attacks. The attacks deteriorate the accuracy of the best defenses (a combination of cropping, TVM, image quilting, and model transfer) by at most 6%. Gains of 1-2% in classification accuracy could be found from ensembling different defenses, while gains of 2-3% were found from transferring attacks to different network architectures.<br />
<br />
<br />
[[File:sTab1.png|600px|thumb|center|Table 1. Top-1 classification accuracy of ensemble and model transfer defenses (columns) against four black-box attacks (rows). The four networks we use to classify images are ResNet-50 (RN50), ResNet-101 (RN101), DenseNet-169 (DN169), and Inception-v4 (Iv4). Adversarial images are generated by running attacks against the ResNet-50 model, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. Higher is better. The best defense against each attack is typeset in boldface.]]<br />
<br />
==GrayBox - Image Transformation at Training and Test Time ==<br />
In this experiment, the adversary has access to the network and the related parameters (but does not have access to the input transformations applied at test time). From the network trained in-(BlackBox: Image Transformation at Training and Test Time), novel adversarial images were generated by the four attack methods. The results show that Bit-Depth Reduction and JPEG Compression are weak defenses in such a gray box setting. In contrast, image cropping, rescaling, variation minimization, and image quilting are more robust against adversarial images in this setting.<br />
The results for this experiment are shown in below figure. Networks using these defenses classify up to 50 % of images correctly.<br />
<br />
[[File:sFig6.png|center| 600px |]]<br />
<br />
==Comparison With Ensemble Adversarial Training==<br />
The results of the experiment are compared with the state of the art ensemble adversarial training approach proposed by Tramer et al. [2]. Ensemble Training fits the parameters of a Convolutional Neural Network on adversarial examples that were generated to attack an ensemble of pre-trained models. The model release by Tramer et al [2]: an Inception-Resnet-v2, trained on adversarial examples generated by FGSM against Inception-Resnet-v2 and Inception-v3 models. The authors compared their ResNet-50 models with image cropping, total variance minimization and image quilting defenses. Two assumption differences need to be noticed. Their defenses assume the input transformation is unknown to the adversary and no prior knowledge of the attacks is being used. The results of ensemble training and the pre-processing techniques mentioned in this paper are shown in Table 2. The results show that ensemble adversarial training works better on FGSM attacks (which it uses at training time), but is outperformed by each of the transformation-based defenses all other attacks.<br />
<br />
<br />
<br />
[[File:sTab2.png|600px|thumb|center|Table 2. Top-1 classification accuracy on images perturbed using attacks against ResNet-50 models trained on input-transformed images and an Inception-v4 model trained using ensemble adversarial. Adversarial images are generated by running attacks against the models, aiming for an average normalized <math>L_2</math>-dissimilarity of 0.06. The best defense against each attack is typeset in boldface.]]<br />
<br />
=Discussion/Conclusions=<br />
The paper proposed reasonable approaches to countering adversarial images. The authors evaluated Total Variance Minimization and Image Quilting and compared it with already proposed ideas like Image Cropping - Rescaling, Bit Depth Reduction, JPEG Compression, and Decompression on the challenging ImageNet dataset.<br />
Previous work by Wang et al. [10] shows that a strong input defense should be nondifferentiable and randomized. Two of the defenses - namely Total Variation Minimization and Image Quilting, both possess this property. However, it may still be possible to train a network to perhaps act as an approximation to the non-differentiable transformation. <br />
<br />
Image quilting involves a discrete variable that conducts the selection of a patch from the database, which is a non-differentiable operation.<br />
Additionally, total variation minimization randomly conducts pixels selection from the pixels it uses to measure reconstruction<br />
error during creation of the de-noised image. Image quilting conducts a random selection of a particular K<br />
nearest neighbor uniformly but in a random manner. This inherent randomness makes it difficult to attack the model. <br />
<br />
Future work suggests applying the same techniques to other domains such as speech recognition and image segmentation. For example, in speech recognition, total variance minimization can be used to remove perturbations from waveforms and "spectrogram quilting" techniques that reconstruct a spectrogram could be developed. The proposed input-transformation defenses can also be combined with ensemble adversarial training by Tramèr et al.[2] to study new attack methods.<br />
<br />
=Critiques=<br />
1. The terminology of Black Box, White Box, and Grey Box attack is not exactly given and clear.<br />
<br />
2. White Box attacks could have been considered where the adversary has a full access to the model as well as the pre-processing techniques.<br />
<br />
3. Though the authors did a considerable work in showing the effect of four attacks on ImageNet database, much stronger attacks (Madry et al) [7], could have been evaluated.<br />
<br />
4. Authors claim that the success rate is generally measured as a function of the magnitude of perturbations, performed by the attack using the L2- dissimilarity, but the claim is not supported by any references. None of the previous work has used these metrics.<br />
<br />
5. ([https://openreview.net/forum?id=SyJ7ClWCb])In the new draft of the paper, the authors add the sentence "our defenses assume that part of the defense strategy (viz., the input transformation) is unknown to the adversary".<br />
<br />
This is a completely unreasonable assumption. Any algorithm which hopes to be secure must allow the adversary to, at the very least, understand what the defense is that's being used. Consider a world where the defense here is implemented in practice: any attacker in the world could just go look up the paper, read the description of the algorithm, and know how it works.<br />
<br />
=References=<br />
<br />
1. Chuan Guo , Mayank Rana & Moustapha Ciss´e & Laurens van der Maaten , Countering Adversarial Images Using Input Transformations<br />
<br />
2. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel, Ensemble Adversarial Training: Attacks and defenses.<br />
<br />
3. Abigail Graese, Andras Rozsa, and Terrance E. Boult. Assessing threat of adversarial examples of deep neural networks. CoRR, abs/1610.04256, 2016. <br />
<br />
4. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. CoRR, abs/1610.01239, 2016a.<br />
<br />
5. Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. CoRR, abs/1704.01155, 2017. <br />
<br />
6. Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.<br />
<br />
7. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu .Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv:1706.06083v3<br />
<br />
8. Alexei Efros and William Freeman. Image quilting for texture synthesis and transfer. In Proc. SIGGRAPH, pp. 341–346, 2001.<br />
<br />
9. Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.<br />
<br />
10. Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G. Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. CoRR, abs/1612.01401, 2016b.<br />
<br />
11. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. CoRR, abs/1611.02770, 2016.<br />
<br />
12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. CoRR, abs/1707.05373, 2017 <br />
<br />
13. Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. CoRR,abs/1708.06939, 2017.<br />
<br />
14. Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016b.<br />
<br />
15. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582, 2016.<br />
<br />
16. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57, 2017.<br />
<br />
17. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.<br />
<br />
18. Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Attend_and_Predict:_Understanding_Gene_Regulation_by_Selective_Attention_on_Chromatin&diff=42388Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin2018-12-11T00:52:07Z<p>Msminhas: Editorial</p>
<hr />
<div>This page contains a summary of the paper [https://arxiv.org/abs/1708.00339 "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin."] by Singh, Ritambhara, et al. It was published at the Advances in Neural Information Processing Systems (NIPS) in 2017. The code for this paper is shared here[https://qdata.github.io/deep4biomed-web/].<br />
<br />
<br />
= Background =<br />
<br />
Gene regulation is the process of controlling which genes in a cell's DNA are turned 'on' (expressed) or 'off' (not expressed). By this process, a functional product such as a protein is created. Even though all the cells of a multicellular organism (e.g., humans) contain the same DNA, different types of cells in that organism may express very different sets of genes. As a result, each cell types have distinct functionality. In other words how a cell operates depends upon the genes expressed in that cell. Many factors including ‘Chromatin modification marks’ influence which genes are abundant in that cell.<br />
<br />
The function of chromatin is to efficiently wraps DNA around bead-like structures of histones into a condensed volume to fit into the nucleus of a cell, and protect the DNA structure and sequence during cell division and replication. Different chemical modifications in the histones of the chromatin, known as histone marks, change spatial arrangement of the condensed DNA structure. Which in turn affects the gene’s expression of the histone mark’s neighboring region. Histone marks can promote (obstruct) the gene to be turned on by making the gene region accessible (restricted). This section of the DNA, where histone marks can potentially have an impact, is known as DNA flanking region or ‘gene region’ which is considered to cover 10k base pair centered at the transcription start site (TSS) (i.e., a 5k base pair in each direction). Unlike genetic mutations, histone modifications are reversible [1]. Therefore, understanding the influence of histone marks in determining gene regulation can assist in developing drugs for genetic diseases.<br />
<br />
= Introduction = <br />
<br />
Revolution in genomic technologies now enables us to profile genome-wide chromatin mark signals. Therefore, biologists can now measure gene expressions and chromatin signals of the ‘gene region’ for different cell types covering whole human genome. The Roadmap Epigenome Project (REMC, publicly available) [2] recently released 2,804 genome-wide datasets of 100 separate “normal” (not diseased) human cells/tissues, among which 166 datasets are gene expression reads and the rest are signal reads of various histone marks. The goal is to understand which histone marks are the most important and how they interact together in gene regulation for each cell type.<br />
<br />
Signal reads for histone marks are high-dimensional and spatially structured. Influence of a histone modification mark can be anywhere in the gene region (covering 10k base pairs centered around the Transcription Start Site of each gene). It is important to understand how the impact of the mark on gene expression varies over the gene region. In other words, how histone signals over the gene region impacts the gene expression. There are different types of histone marks in human chromatin that can have an influence on gene regulation. Researchers have found five standard histone proteins. These five histone proteins can be altered in different combinations with different chemical modifications resulting in a large number of distinct histone modification marks. Different histone modification marks can act as a module to interact with each other and influence the gene expression.<br />
<br />
<br />
This paper proposes an attention-based deep learning model to find how this chromatin factors/ histone modification marks contributes to the gene expression of a particular cell. AttentiveChrome[3] utilizes a hierarchy of multiple LSTM to discover interactions between signals of each histone marks, and learn dependencies among the marks on expressing a gene. The authors included two levels of soft attention mechanism, (1) to attend to the most relevant signals of a histone mark, and (2) to attend to the important marks and their interactions. In this context, ''attention'' refers to weighing the importance of different items differently.<br />
<br />
== Main Contributions ==<br />
The contributions of this work can be summarized as follows:<br />
<br />
* More accurate predictions than the state-of-the-art baselines. This is measured using datasets from REMC on 56 different cell types.<br />
* Better interpretation than the state-of-the-art methods for visualizing deep learning model. They compute the correlation of the attention scores of the model with the mark signal from REMC. <br />
* Like the application of attention models previously in indirectly hinting the parts of the input that the model deemed important, AttentiveChrome can too explain it's decisions by hinting at “what” and “where” it has focused.<br />
* This is the first time that the attention based deep learning approach is applied to a problem in molecular biology.<br />
* Ability to deal with highly modular inputs<br />
<br />
= Previous Works = <br />
<br />
Machine learning algorithms to classify gene expression from histone modification signals have been surveyed by [15]. These algorithms vary from linear regression, support vector machine, and random forests to rule-based learning, and CNNs. To accommodate the spatially structured, high dimensional input data (histone modification signals) these studies applied different feature selection strategies. The preceding research study, DeepChrome [4], by the authors incorporated the best position selection strategy. The positions that are highly correlated to the gene expression are considered as the best positions. This model can learn the relationship between the histone marks. This CNN based DeepChrome model outperforms all the previous works. However, these approaches either (1) failed to model the spatial dependencies among the marks, or (2) required additional feature analysis. Only AttentiveChrome is reported to satisfy all of the eight desirable metrics of a model.<br />
<br />
= AttentiveChrome: Model Formulation =<br />
<br />
The authors proposed an end-to-end architecture which has the ability to simultaneously attend and predict. This method incorporates recurrent neural networks (RNN) composed of LSTM units to model the sequential spatial dependencies of the gene regions and predict gene expression level from The embedding vector, <math> h_t </math>, output of an LSTM module encodes the learned representation of the feature dependencies from the time step 0 to <math> t </math>. For this task, each bin position of the gene region is considered as a time step.<br />
<br />
The proposed AttentiveChrome framework contains following 5 important modules:<br />
<br />
* Bin-level LSTM encoder encoding the bin positions of the gene region (one for each HM mark)<br />
* Bin-level <math> \alpha </math>-Attention across all bin positions (one for each HM mark)<br />
* HM-level LSTM encoder (one encoder encoding all HM marks)<br />
* HM-level <math> \beta </math>-Attention among all HM marks (one)<br />
* The final classification module<br />
<br />
Figure 1 (Supplementary Figure 2) presents an overview of the proposed AttentiveChrome framework.<br />
<br />
<br />
[[File:supplemntary_figure_2.png|thumb|center| 800px |Figure 1: Overview of the all five modules of the proposed AttentiveChrome framework]]<br />
<br />
<br />
<br />
== Input and Output ==<br />
<br />
Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where <math> +1 </math> denotes gene is expressed (on) and <math> -1 </math> denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix <math> \textbf{X} </math> of size <math> M \times T </math>, where <math> M </math> is the number of HM marks considered in the input, and <math> T </math> is the number of bin positions taken into account to represent the gene region. The <math> j^{th} </math> row of the vector <math> \textbf{X} </math>, <math> x_j</math>, represents sequentially structured signals from the <math> j^{th} </math> HM mark, where <math> j\in \{1, \cdots, M\} </math>. Therefore, <math> x_j^t</math>, in the matrix <math> \textbf{X} </math> represents the value from the <math> t^{th}</math> bin belonging to the <math> j^{th} </math> HM mark, where <math> t\in \{1, \cdots, T\} </math>. If the training set contains <math>N_{tr} </math> labeled pairs, the <math> n^{th} </math> is specified as <math>( X^n, y^n)</math>, where <math> X^n </math> is a matrix of size <math> M \times T </math> and <math> y^n \in \{ -1, +1 \} </math> is the binary label, and <math> n \in \{ 1, \cdots, N_{tr} \} </math>.<br />
<br />
Figure 2 (also refer to Figure 1 (a), and 1(b) for better understanding) exhibits the input feature and the output of AttentiveChrome for a particular gene (one sample).<br />
<br />
[[File:input-output-attentivechrome.png|center|thumb| 700px | Figure 2: Input and Output of the AttentiveChrome model]]<br />
<br />
== Bin-Level Encoder (one LSTM for each HM) ==<br />
The sequentially ordered elements (each element actually is a bin position) of the gene region of <math> n^{th} </math> gene is represented by the <math> j_{th} </math> row vector <math> x^j </math>. The authors considered each bin position as a time step for LSTM. This study incorporates bidirectional LSTM to model the overall dependencies among a total of <math> T </math> bin positions in the gene region. The bidirectional LSTM contains two LSTMs<br />
* A forward LSTM, <math> \overrightarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_1^j </math> to <math> x_T^j </math>, which outputs the embedding vector <math> \overrightarrow{h^t_j} </math>, of size <math> d </math> for each bin <math> t </math><br />
* A reverse LSTM, <math> \overleftarrow{LSTM_j} </math>, to model <math> x^j </math> from <math> x_T^j </math> to <math> x_1^j </math>, which outputs the embedding vector <math> \overleftarrow{h^j_t} </math>, of size <math> d </math> for each bin <math> t </math><br />
<br />
The final output of this layer, embedding vector at <math> t^{th} </math> bin for the <math> j^{th} </math> HM, <math> h^j_t </math>, of size <math> d </math>, is obtained by concatenating the two vectors from the both directions. Therefore, <math> h^j_t = [ \overrightarrow{h^j_t}, \overleftarrow{h^j_t}]</math>. By pairing these LSTM-based HM encoders with the final classification, embedding each HM mark by drawing out the dependencies among bins can be learned by these pairs.Figure 1 (c) illustrates the module for <math> j=2 </math>.<br />
<br />
== Bin-Level <math> \alpha</math>-attention ==<br />
<br />
Each bin contributes differently in the encoding of the entire <math> j^{th} </math> mark. To automatically and adaptively highlight the most important bins for prediction, a soft attention weight vector <math> \alpha^j </math> of size <math> T </math> is learned for each <math> j </math>. To calculated the soft weight <math> \alpha^j_t </math>, for each <math> t </math>, the embedding vectors <math> \{h^j_1, \cdots, h^j_t \} </math> of all the bins are utilized. The following equation is used:<br />
<br />
<center><math> \alpha^j_t = \frac{exp(\textbf{W}_b h^j_t)}{\sum_{i=1}^T{exp(\textbf{W}_b h^j_i)}} </math></center><br />
<br />
<br />
<math> \alpha^j_t</math> is a scalar and is computed by all bins’ embedding vectors <math>h^j</math>. The parameter <math> W_b </math> is initialized randomly, and learned alongside during the process with the other model parameters. Therefore, once we have importance weight of each bin position, the <math> j^{th} </math> HM mark can be represented by <math> m^j = \sum_{t=1}^T{\alpha^j_t \times h^j_t}</math>. Here, <math> h^j_t</math> is the embedding vector and <math> \alpha^t_j </math> is the importance weight of the <math> t^{th} </math> bin in the representation of the <math> j^{th} </math> HM mark. Intuitively <math> \textbf{W}_b </math> will learn the cell type. Figure 1(d) shows this module for <math> HM_2 </math>.<br />
<br />
== HM-level Encoder (one LSTM) ==<br />
<br />
Studies observed that HMs work cooperatively to provoke or subdue gene expression [5]. The HM-level encoder (not in the fFgure 1) utilizes one bidirectional LSTM to capture this relationship between the HMs. To formulate the sequential dependency a random sequence is imagined as the authors did not find influence of any specific ordering of the HMs. The representation <math> m_j </math>of the <math> j^{th} </math> HM, <math> HM_j </math>, which is calculated from the bin-level attention layer, is the input of this step. This set based encoder outputs an embedding vector <math> s^j </math> of size <math> d’ </math>, which is the encoding for the <math> j^{th} </math> HM.<br />
<br />
<math> s^j = [ \overrightarrow{LSTM_s}(m_j), \overleftarrow{LSTM_s}(m_j) ] </math><br />
<br />
The dependencies between <math> j^{th} </math> HM and the other HM marks are encoded in <math> s^j </math>, whereas <math> m^j </math> from the previous step encodes the bin dependencies of the <math> j^{th} </math> HM.<br />
<br />
[[File:table1.png|center|thumb| 700px | Table 1: Comparison of previous studies for the task of quantifying gene expression using histone modification marks. AttentiveChrome is the only model that exhibits all 8desirable properties.]]<br />
<br />
== HM-Level <math> \beta</math>-attention ==<br />
This second soft attention level (Figure 1(e)) finds the important HM marks for classifying a gene’s expression by learning the importance weights, <math> \beta_j </math>, for each <math> HM_j </math>, where <math> j \in \{ 1, \cdots, M \} </math>. The equation is <br />
<br />
<math> \beta^j = \frac{exp(\textbf{W}_s s^j)}{\sum_{i=1}^M{exp(\textbf{W}_s s^j)}} </math><br />
<br />
The HM-level context parameter <math> \textbf{W}_s </math> is trained jointly in the process. Intuitively <math> \textbf{W}_s </math> learns how the HMs are significant for a cell type. Finally the entire gene region is encoded in a hidden representation <math> \textbf{v} </math>, using the weighted sum of the embedding of all HM marks. <br />
<br />
<br />
<math> \textbf{v} = \sum_{j=1}^MT{\beta^j \times s^j}</math><br />
<br />
== End-to-end training ==<br />
<br />
The embedding vector <math> \textbf{v} </math> is fed to a simple classification module, <math> f(\textbf{v}) = </math>softmax<math> (\textbf{W}_c\textbf{v}+b_c) </math>, where <math> \textbf{W}_c </math>, and <math> b_c </math> are learnable parameters. The output is the probability of gene expression being high (expressed) or low (suppressed).<br />
The whole model including the attention modules is differentiable. Thus backpropagation can perform end-to-end learning trivially. The negative log-likelihood loss function is minimized in the learning.<br />
<br />
= Experimental Settings =<br />
<br />
This work makes use of the REMC dataset. AttentiveChrome is evaluated on 56 different cell types. Similar to DeepChrome, this study considered the following five core HM marks (<math> M=5 </math>). Because these selected marks are uniformly profiled across all 56 cell types in the REMC study.<br />
<br />
[[File:HM.png|center|thumb| 700px | Table 1: Five core HM marks and their attributes considered in this paper]]<br />
<br />
<br />
<br />
For a gene region, 10k base pairs centred at the TSS site (5k bp in each direction) are taken into account. These 10k base pairs are divided into 100 bins, each bin consisting of <math> T=100 </math> continuous bp). Therefore, for each gene in a particular cell type, the input matrix will be of size <math> 5 \times 100 </math>. The gene expression labels are normalized and discretized to represent binary labelling. The sample dataset is divided into three equal sized folds for training, validation, and testing.<br />
<br />
== Model Variations and Two Baselines ==<br />
To evaluate the performance of the proposed model the authors considered RNN method (direct LSTM without any attention), and their prior work DeepChrome as baselines. The results obtained from multiple variations of the AttentiveChrome model are compared with the baselines. The authors considered five variant of AttentiveChrome during performance evaluation. The variants are:<br />
<br />
* LSTM-Attn: one LSTM with attention on the input matrix (does not consider the modular nature of HM marks)<br />
* CNN-Attn: DeepChrome [4] with one attention mechanism incorporated. <br />
* LSTM-<math>\alpha , \beta</math>: the proposed architecture.<br />
* CNN-<math>\alpha , \beta</math>: LSTM module of the proposed architecture replaced with CNN. This variation includes two attention mechanisms. First attention mechanism contains one <math>\alpha</math>-attention on top of a CNN module per HM mark. And, the second -<math>\beta</math>- attention mechanism is used to combine HMs.<br />
* LSTM-<math>\alpha</math>: one LSTM and <math>\alpha</math>-attention per HM mark.<br />
<br />
== Hyperparameters ==<br />
<br />
For all the variants of AttentiveChrome the bin-level LSTM embedding size <math> d</math> is set to 32, and the HM-level LSTM embedding size <math>d’</math> is set to 16. Because of bidirectional LSTM, the size of the embedding vector <math> h_t</math>, and <math>m_j</math> will be 64, and 32 respectively. Size of the context vectors are set accordingly.<br />
<br />
= Performance Evaluation =<br />
<br />
== AUC Scores ==<br />
<br />
This study summarizes AUC scores across all 56 cell types on the test set to compare the methods.<br />
<br />
[[File:AUC.JPG|center|thumb| 700px | Table 2: AUC score performances for different variations of AttentiveChrome and baselines]]<br />
<br />
Overall the LSTM-attention models perform better than the DeepChrome (CNN-based) and LSTM baselines. The authors argue that the proposed AttentiveChrome model is a good choice because of its interpretability, even though the performance improvement from DeepChrome is insignificant.<br />
<br />
== Evaluation of Attention Scores for Interpretation ==<br />
<br />
To understand if the model is focusing on the right regions, the authors make use of additional study results from REMC database. To validate the bin attention,signal data of a new histone mark, H3K27ac, referred to as <math>H_{active}</math> in this article, from REMC database is utilized. This particular histone mark is known to mark active region when the gene is expressed (ON). Genome-wide read of this HM mark is available for three important cell types: stem cell (H1-hESC), blood cell (GM12878), and leukemia cell (K562). This particular HM mark is used to analyze the visualization results only and not applied in the learning phase. The authors discussed the performance of both the attention mechanisms in this section. <br />
<br />
=== Correlation of Importance Weight of <math>H_{prom}</math> with <math>H_{active}</math> ===<br />
<br />
Average read count of <math>H_{active}</math> across all 100 bins for all the active genes (ON or labeled as <math>+1</math>) in the three selected cell types is calculated. The proposed AttentiveChrome and LSTM-<math>\alpha</math> methods are compared with two widely used visualization techniques, (1) class based, and (2) saliency map applied on the baseline DeepChrome model (CNN-based prior work). Using these visualization methods, the authors calculate the importance weights for <math>H_{prom}</math> (promoter HM mark used in training) across the 100 bins. The Pearson Correlation score between these importance weights and the read count of the <math>H_{active}</math> (HM mark for validation) across the same 100 bins is computed. The <math>H_{active}</math> read counts indicates the actual active regions of those cells. <br />
<br />
[[File: pc.JPG|center|thumb| 700px | Figure 4: Pearson Correlation between a known active HM mark]]<br />
<br />
<br />
The results indicate that the proposed models consistently gained the highest correlation with <math>H_{active}</math> for all three cell types. Thus, the proposed method is successful to capture the important signals.<br />
<br />
=== Visualization of Attention Weight of bins for each HM of a specific cell type GM12878===<br />
<br />
To visualize bin level attention weights, the authors plotted the average bin-level attention weights for each HM for a specific cell type GM12878 (blood cell) for expressed (ON) genes and suppressed (OFF) genes separately. <br />
<br />
[[File: figure2.png|center|thumb| 700px |]]<br />
<br />
For the “ON” genes, the attention profiles are well defined for the HM marks, <math>H_{prom}</math>, <math>H_{enhc}</math>, <math>H_{struct}</math>. On the other hand, the weights are low for <math>H_{reprA}</math> and <math>H_{reprB}</math>. The average trend reverses for the “OFF” genes, where the repressor HM marks have more influence than the <math>H_{prom}</math>, <math>H_{enhc}</math>, <math>H_{struct}</math>. This observation agrees with the biologist finding that <math>H_{prom}</math>, <math>H_{enhc}</math>, <math>H_{struct}</math> marks stimulates gene activation and, <math>H_{reprA}</math> and <math>H_{reprB}</math> mark restrains the genes.<br />
<br />
=== Attention Weight of bins with <math>H_{active}</math>===<br />
<br />
The average read counts of <math>H_{active}</math> for the same 100 bins across all the active (ON) genes for the cell type GM12878 is plotted (FIGURE 2(b)). Besides, for AttentiveChrome the plot of bin-level attention weights of averaged over all the genes that are PREDICTED ON for GM12878 is also provided. The plots exhibit that the <math>H_{prom}</math> profile is similar to <math>H_{active}</math>.<br />
<br />
=== Visualization of HM-level Attention Weight for Gene PAX5 ===<br />
<br />
To visualize HM-level attention weight the authors produces a heatmap for a differentially regulated gene, PAX5, for the three aforementioned cell types. The heatmap is presented in FIGURE 2(c). PAX5 plays a significant role in gene regulation when stem cells convert to blood cells. This gene is OFF in stem cells (H1-hESC), however, it becomes activated when the stem cell is transformed into blood cell (GM12878). The <math>\beta_j</math> weight for <math>H_{repr}</math> is high when the gene is OFF in H1-hESC, and the weight decreases when the gene is ON in GM12878. On the contrary, for <math>H_{prom}</math> mark the <math>\beta_j</math> weight increases from H1-hESC to GM12878 as the gene becomes activated. This information extracted by the deep learning model is also supported by biological literature [16].<br />
<br />
= Related Works/Studies =<br />
<br />
In the last few years, deep learning models obtained models obtained unprecedented success in diverse research fields. Though as not rapidly as other fields, deep learning based algorithms are gaining popularity among bioinformaticians.<br />
<br />
== Attention-based Deep Models ==<br />
<br />
The idea of attention technique in deep learning is adapted from the human visual perception system. Humans tend to focus over some parts more than the others while perceiving a scene. This mechanism augmented with deep neural networks achieved an excellent outcome in several research topics, such as machine translation. Various types of attention models e.g., soft [6], or location-aware [7], or hard [8, 9] attentions have been proposed in the literature. In the soft attention model, a soft weight vector is calculated for the overall feature vectors. The extent of the weight is correlated with the degree of importance of the feature in the prediction. In practice, RNN is often used to help implement such models.<br />
<br />
== Visualization and Apprehension of Deep Models ==<br />
<br />
Prior studies mostly focused on interpreting convolutional neural networks (CNN) for image classification. Deconvolution approaches [10] attempt to map hidden layer representations back to an input space. Saliency maps [11, 12], attempt to use taylor expansion to approximate the network, and identify the most relevant input features. Class optimization [12] based visualization techniques attempt to find the best example member of each class. Some recent research works [13, 14] tried to understand recurrent neural networks (RNN) for text-based problems. By looking into the features the model attends to, we can interpret the output of a deep model.<br />
<br />
== Deep Learning in Bioinformatics ==<br />
Deep learning is also getting popular in bioinformatics fields because it is able to extract meaningful representations from datasets. Scholars use deep learning to model protein sequences and DNA sequences and predicting gene expressions, as well as making-sense of the effects of non-coding variants.<br />
<br />
== Previous model for gene expression predictions ==<br />
There were multiple machine learning models had been used to predict gene expressions from histone modification data (surveyed in [19]), such as linear regression[21], random forests[18], rule-based learning [19] and CNNs [22] and support vector machines[17]. These studies designed different feature selection strategies to accommodate a large amount of histone modification signals as input. The strategies included using signal averaging across all relevant positions and selecting input signals at positions where was highly correlated to target gene expression and then use CNN (called DeepChrome [22]) to learn combinatorial interactions among histone modification marks. DeepChrome outperformed all previous methods (see Supplementary) on this task and used a class optimization-based technique for visualizing the learned model. However, this class-level visualization lacks the necessary granularity to understand the signals from multiple chromatin marks at the individual gene level.<br />
<br />
= Conclusion = <br />
<br />
The paper has introduced an attention-based approach called "AttentiveChrome" that deals with both understanding and prediction with several advantages on previous architectures including higher accuracy from state-of-the-art baselines, clearer interpretation than saliency map, which allows them to view what the model ‘sees’ during prediction, and class optimization. Another advantage of this approach is that it can model modular feature inputs which are sequentially structured. Finally, according to the authors, this is the first implementation of deep attention to understand gene regulation. AttentiveChrome is claimed to be the first attention based model applied on a molecular biology dataset. The authors expect that through this deep attention mechanism, the biologists can have a better understanding of epigenomic data. It can model feature inputs that are sequentially structured. This model can handle understanding and prediction of hard to interpret biological data as it grants insights<br />
to the predictions by locating ‘what’ and ‘where’ AttentiveChrome has focused.<br />
<br />
= Critiques =<br />
<br />
This paper does not give a considerable algorithmic contribution. They have only used existing methods for this application. This deep learning based method is shown to perform better than simple machine learning models like linear regression and SVMs but this is considerably harder to implement and has many more hyperparameters to tune. The training time is considerably higher, especially because all the parameters are learned together. The dataset considered in the application here also seems to have only a limited number of samples for a study of high complexity. Model hyperparameters have been chosen randomly without any explanation of intuition for them. The authors have also not cited any relevant literature to understand where these numbers came from. <br />
<br />
Discussion about attention scores for interpretation does not provide any clear definition or mention previous literature using them. Reference of literature about H3K27ac, and how its read counts represent active region of a cell should be included. No reasoning was given for why only one specific cell type is used to visualize bin level attention weights. Example of some other real world problems where this model can be useful should be provided.<br />
<br />
Moreover, this paper relies heavily on the intuition. Due to complicated structures, it must be challenging to provide algorithmic/theoretical justifications. This means that there is no proper guidence of how hyperparameters should be chosen or any kinds of treatment that the author performs on other data sets.<br />
<br />
= Additional Resources =<br />
<br />
# [https://qdata.github.io/deep4biomed-web/ Official DeepChrome Website]<br />
# [http://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin-supplemental.zip Supplemental Resources]<br />
# [https://github.com/QData/AttentiveChrome/blob/master/NIPS%20poster.pdf Poster]<br />
# [https://www.youtube.com/watch?v=tfgmXvSgsQE&feature=youtu.be Video Presentation]<br />
<br />
= Reference =<br />
<br />
[1] Andrew J Bannister and Tony Kouzarides. Regulation of chromatin by histone modifications. Cell Research, 21(3):381–395, 2011.<br />
<br />
[2] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.<br />
<br />
[3] Singh, Ritambhara, et al. "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin." Advances in Neural Information Processing Systems. 2017.<br />
<br />
[4] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.<br />
<br />
[5] Joanna Boros, Nausica Arnoult, Vincent Stroobant, Jean-François Collet, and Anabelle Decottignies. Polycomb repressive complex 2 and h3k27me3 cooperate with h3k9 methylation to maintain heterochromatin protein 1α at chromatin. Molecular and cellular biology, 34(19):3662–3674, 2014.<br />
<br />
[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.<br />
<br />
[7] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 577–585. Curran Associates, Inc., 2015.<br />
<br />
[8] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics.<br />
<br />
[9] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016.<br />
<br />
[10] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pages 818–833. Springer, 2014.<br />
<br />
[11] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert MÃžller. How to explain individual classification decisions. volume 11, pages 1803–1831, 2010.<br />
<br />
[12] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. 2013.<br />
<br />
[13] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and understanding recurrent networks. 2015.<br />
<br />
[14] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. 2015.<br />
<br />
[15] Xianjun Dong and Zhiping Weng. The correlation between histone modifications and gene expression. Epigenomics, 5(2):113–116, 2013.<br />
<br />
[16] Shane McManus, Anja Ebert, Giorgia Salvagiotto, Jasna Medvedovic, Qiong Sun, Ido Tamir, Markus Jaritz, Hiromi Tagoh, and Meinrad Busslinger. The transcription factor pax5 regulates its target genes by recruiting chromatin-modifying proteins in committed b cells. The EMBO journal, 30(12):2388–2404, 2011.<br />
<br />
[17] ChaoCheng,Koon-KiuYan,KevinYYip,JoelRozowsky,RogerAlexander,ChongShou,MarkGerstein, et al. A statistical framework for modeling gene expression using chromatin features and application to modencode datasets. Genome Biol, 12(2):R15, 2011.<br />
<br />
[18] XianjunDong,MelissaCGreven,AnshulKundaje,SarahDjebali,JamesBBrown,ChaoCheng,ThomasR Gingeras, Mark Gerstein, Roderic Guigó, Ewan Birney, et al. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol, 13(9):R53, 2012.<br />
<br />
[19] Xianjun Dong and Zhiping Weng. The correlation between histone modifications and gene expression. Epigenomics, 5(2):113–116, 2013.<br />
<br />
[20] Bich Hai Ho, Rania Mohammed Kotb Hassen, and Ngoc Tu Le. Combinatorial roles of dna methylation and histone modifications on gene expression. In Some Current Advanced Researches on Information and Computer Science in Vietnam, pages 123–135. Springer, 2015.<br />
<br />
[21] Rosa Karlic ́, Ho-Ryun Chung, Julia Lasserre, Kristian Vlahovicˇek, and Martin Vingron. Histone mod- ification levels are predictive for gene expression. Proceedings of the National Academy of Sciences, 107(7):2926–2931, 2010.<br />
<br />
[22] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTI-VIEW_DATA_GENERATION_WITHOUT_VIEW_SUPERVISION&diff=42387MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION2018-12-11T00:48:01Z<p>Msminhas: Editorial</p>
<hr />
<div>This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018. An implementation of the models presented in this paper is available here[https://github.com/mickaelChen/GMV]<br />
<br />
==Introduction==<br />
<br />
===Motivation===<br />
High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same. The authors claim that unlike many multiview<br />
approaches, the proposed model doesn’t need any supervision on the views but only on the content.<br />
<br />
===Related Work===<br />
<br />
The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of the same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view (i.e. methods based on unlabeled samples), also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually<br />
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps in the learning of such model, yet prevents their use on many datasets where this information is not available.<br />
<br />
Recently such attempts have been made to learn such models without supervision, but they cannot disentangle high level concepts as only simple features can be reliably captured without any guidance.<br />
<br />
===Contributions===<br />
<br />
The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.<br />
<br />
Precisely,two models have been proposed:<br />
# a generative model ('''GMV - Generative Multi-view Model''') that generates objects under various views (multiview generation), <br />
# and a conditional extension, '''conditional GMV (C-GMV)''' of this model that generates a large number of views of any input object (conditional multi-view generation). <br />
<br />
Both models are based on the adversarial training schema of Generative Adversarial Networks (GAN) proposed in Goodfellow et al. (2014)). The simple but strong idea is to focus on distributions over pairs of examples (e.g. images representing a same object in different views) rather than distribution on single examples.<br />
<br />
==Paper Overview==<br />
<br />
===Background===<br />
<br />
The paper uses the concept of the popular GAN (Generative Adversarial Networks) proposed by Goodfellow et al.(2014).<br />
<br />
GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs was introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”<br />
<br />
Let us denote <math>X</math> an input space composed of multidimensional samples <math>x</math> e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples <math>G(z)</math> where <math>z ∼ p_z</math>. A GAN defines, in addition to <math>G</math>, a discriminator function <math>D : X → [0; 1]</math> which aims at differentiating between real inputs sampled from the training set and fake inputs sampled from <math>p_G</math>, while the generator learns to fool the discriminator <math>D</math>. Usually both <math>G</math> and <math>D</math> are implemented with neural networks. The objective function is based on the following adversarial criterion:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div><br />
<br />
where <math>p_x</math> is the empirical data distribution on <math>X</math> .<br />
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.<br />
<br />
CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:<br />
<br />
In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. The conditionality of a CGAN is determined by defining a generator function <math>G</math> which takes a noise vector <math>z</math> and a condition <math>y</math> as inputs. Now, to add such a condition to both generator and discriminator, we will simply feed some vector <math>y</math>, into both networks. Hence, both the discriminator <math>D(X,y)</math> and generator <math>G(z,y)</math> are jointly distributed with <math>y</math>. A target <math>X</math> from a given input <math>y</math> can be obtained by first sampling the latent vector <math>z ∼ p_z</math>, then by computing <math>G(y, z)</math>. The discriminator takes both the condition <math>y</math> and the datapoint <math>x</math> as inputs.<br />
<br />
Now, the objective function of CGAN is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div><br />
<br />
The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor <math>z</math> and generating <math>x</math> only based on the condition <math>y</math>, exhibiting an almost deterministic behavior. At this point, the CGAN also fails to produce a satisfying amount of diversity in generated samples.<br />
<br />
===Generative Multi-View Model===<br />
<br />
''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure<br />
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.<br />
<br />
The paper defines two tasks here to be done: <br />
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.<br />
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axes,<br />
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labeling information.<br />
<br />
The paper introduces the vectors '''c''' and '''v''' to represent latent vectors in R<sup>c</sup> and R<sup>v</sup><br />
<br />
<br />
''' Generative Multi-view Model: '''<br />
<br />
Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. The objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).<br />
<br />
The key idea that the authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from the same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.<br />
<br />
Now, the objective function of GMV Model is:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div><br />
<br />
Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view<br />
<br />
<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div><br />
<br />
===Conditional Generative Model (C-GMV)===<br />
<br />
C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model. This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math><br />
<br />
Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).<br />
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN is known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.<br />
<br />
The Objective function for C-GMV will be:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div><br />
<br />
<br />
At inference time, as with the GMV model, we are interested in getting the encoder E and the<br />
generator G. These models may be used for generating new views of any object which is observed<br />
as an input sample x by computing its content vector E(x), then sampling <math>v ∼ p_v</math> and finally by<br />
computing the output G(E(x), v)<br />
<br />
==Experiments and Results==<br />
<br />
The authors have given an exhaustive set of results and experiments.<br />
<br />
Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used is if two samples correspond to the same object or not.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div><br />
<br />
<br />
Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015). The encoder E is similar to D with the only differences being the batch-normalization in the first layer and the last layer which doesn't have a non-linearity. The Adam optimizer was used, with a batch size of 128. The learning rates for G and D were set to 1*10<sup>-3</sup> and 2*10<sup>-4</sup> respectively for the GMV experiments. In the C-GMV experiments, learning rates of 5*10<sup>-5</sup> were used. Alternating gradient descent was used to optimize the different objectives of the network components (generator, encoder and discriminator).<br />
<br />
Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models, authors have built baselines working in the same conditions as the models in this paper. In addition, models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views. <br />
<br />
===Generating Multiple Contents and Views===<br />
<br />
Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by the DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div><br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div><br />
<br />
<br />
Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets. In the left hand block of Figure 5, each row shows different views generated given the same content. <br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div><br />
<br />
Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). Again, in the left and right hand block of Figure 6, each row shows different views generated given the same content. It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view<br />
does not modify the content and vice versa.<br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div><br />
<br />
===Generating Multiple Views of a Given Object===<br />
<br />
The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row, the input sample is shown in the left column. New views are generated from that input and shown to the right, with those generated from C_GMV in the centre, and those generated from CGAN on the far right.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div><br />
<br />
=== Evaluation of the Quality of Generated Samples ===<br />
<br />
There are usually several metrics to evaluate generative models. Some of them are: <br />
<ol><br />
<li>Inception Score: In a general sense, the Inception Score is a metric used to quantify the “realness” of a generated image. It is calculated across a set of generated images, and considers two criteria. First, all images of the sample class should be similar (low in-class variance). And second, the distribution of classes should not be dominated by any particular class. The better these criteria are met; the higher the Inception Score.</li><br />
<li>Latent Space Interpolation</li><br />
<li>log-likelihood (LL) score</li><br />
<li> minimum description length (MDL) score</li><br />
<li>minimum message length (MML) score</li><br />
<li>Akaike Information Criterion (AIC) score</li><br />
<li>Bayesian Information Criterion (BIC) score</li><br />
</ol><br />
<br />
<br />
<br />
<br />
<br />
The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div><br />
<br />
<br />
<div style="text-align: center;font-size:100%">[[File:table.png]]</div><br />
<br />
==Conclusion==<br />
<br />
The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors. Here, the paper showed that the application of architecture search to dense image prediction was achieved through a) The construction of a recursive search space leveraging innovation in the dense prediction literature b) construction of a fast proxy predictive of a large task. The learned architecture was shown to surpass human invented architectures across three dense image prediction tasks i.e scene parsing, person part segmentation and semantic segmentation. In the future, they are planning to use the method of this paper for data augmentation which can enrich training dataset. .<br />
<br />
==Future Work==<br />
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings. <br />
<br />
==Critique==<br />
<br />
The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.<br />
<br />
However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.<br />
<br />
The method that the paper presented is novel and the paper is easy to follow. However, the authors only show a comparison between the proposed method and several baselines: DCGAN and CGAN and do not compare with the methods from Mathieu et al. 2016. In addition, the experiment result is empirical, we do not know the performance of this method in practice in the real world.<br />
<br />
==References==<br />
<br />
[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018<br />
<br />
[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.<br />
<br />
[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.<br />
<br />
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Reinforcement_Learning_of_Theorem_Proving&diff=42386Reinforcement Learning of Theorem Proving2018-12-11T00:45:57Z<p>Msminhas: Editorial</p>
<hr />
<div>== Introduction ==<br />
Automated reasoning over mathematical proofs was a major motivation for the development of computer science. Automated theorem provers(ATPs) can, in principle, be used to attack any formally stated mathematical problem and is a research area that has been present since the early 20th century [1]. As of today, state-of-art ATP systems rely on the fast implementation of complete proof calculi, such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result, in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.<br />
<br />
In this paper, the authors propose a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi are the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It is shown that reinforcement learning results in a 42.1% performance increase compared to the base prover(without learning).<br />
<br />
== Related Work ==<br />
C. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Since the MaLARea [8] system, number of iterations of a feedback loop between proving and learning have been explored, remarkably improving over human-designed heuristics when reasoning in large theories. However, such systems are known to only learn a high-level selection of relevant facts from a large knowledge base and delegate the internal proof search to standard ATP systems. S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However, they chose deep neural network (CNNs and RNNs) as feature extractor. These systems are treated as black boxes in literature with not much understanding of their performances possible. <br />
<br />
In leanCoP [9], one of the simpler connection tableau systems, the next tableau extension step could be selected using supervised learning. In addition, the first experiments with Monte-Carlo guided proof search [5] have been done for connection tableau systems. The improvement over the baseline measured in that work is much less significant than here. This is closest to the authors' approach but the performance is poorer than this paper.<br />
<br />
On a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [6], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights into how to process datasets of formally stated mathematical problems.<br />
<br />
== First Order Logic and Connection Calculi ==<br />
Here we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. Let us try to prove the following first-order sentence.<br />
<br />
[[file:fof_sentence.png|frameless|450px|center]]<br />
<br />
This sentence can be transformed into a formula in Skolemized Disjunctive Normal Form (DNF), which is referred to as the "matrix".<br />
<br />
[[file:skolemized_dnf.png|frameless|450px|center]] <br />
[[file:matrix.png|frameless|center]] <br />
<br />
The original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempt to show that the Skolemized DNF formula is a tautology by constructing a tableau. We will start at the special node, root, which is an open leaf. At each step, we select a clause (for example, clause <math display="inline">P \wedge R</math> is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from the root to this leaf. If two literals on this path are unifiable (for example, <math display="inline">Qx'</math> is unifiable with <math display="inline">\neg Qc</math>), this leaf is then closed. An example of closed tableaux is shown in Figure 1. In standard terminology, it states that a connection is found on this branch.<br />
<br />
[[file:tableaux_example.png|thumb|center|Figure 1. An example of closed tableaux. Adapted from [2]]]<br />
<br />
The paper's goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, the paper has shown that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.<br />
<br />
In formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as <math display="inline">subgoal, M, path</math> where <math display="inline">subgoal</math> is a list of literals that we need to find connection later, <math display="inline">M</math> stands for the matrix, and <math display="inline">path</math> stands for the path leading to this leaf.<br />
<br />
[[file:formal_calculi.png|thumb|600px|center|Figure 2. Formal connection calculi. Adapted from [2].]]<br />
[[file:formal_tableaux.png|thumb|600px|center|Figure 3. Formal tableaux constructed from the example sentence. Adapted from [2].]]<br />
<br />
To sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function ''prove(subgoal, M, path)'' is stated as follows:<br />
* If ''subgoal'' is empty<br />
** return ''TRUE''<br />
* If reduction is possible<br />
** Perform reduction, generating ''new_subgoal'', ''new_path''<br />
** return ''prove(new_subgoal, M, new_path)''<br />
* For all clauses in ''M''<br />
** If a clause can do extension with ''subgoal''<br />
** Perform extension, generating ''new_subgoal1'', ''new_path'', ''new_subgoal2''<br />
** return ''prove(new_subgoal1, M, new_path)'' and ''prove(new_subgoal2, M, path)''<br />
* return ''FALSE''<br />
<br />
It is important to note that the bare prover implemented in this paper is incomplete. Here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause <math display="inline">P(0)</math> be the first subgoal. Clearly choosing <math display="inline">\neg P(0)</math> to extend will complete the proof.<br />
<br />
[[file:pathological.png|frameless|400px|center]] <br />
<br />
However, if we choose <math display="inline">\neg P(x) \lor P(s(x))</math> to do extension, the algorithm will generate an infinite branch <math display="inline">P(0), P(s(0)), P(s(s(0))) ...</math>. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.<br />
<br />
A technique called iterative deepening can be used to avoid such infinite loop, making the bare prover complete. Iterative deepening will force the prover to try all shorter proofs before moving into long ones, it is effective, but also waste valuable computing resource trying to enumerate all short proofs.<br />
<br />
In addition, the provability of first-order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.<br />
<br />
== Mizar Math Library ==<br />
Mizar Math Library (MML) [7, 10] is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem "If <math display="inline"> p </math> is prime, then <math display="inline"> \sqrt p </math> is irrational."<br />
<br />
[[file:mizar_article.png|thumb|center|Figure 4. An article from MML. Adapted from [6].]]<br />
<br />
The training and testing data for this paper is a subset of MML, the Mizar40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar40 library, it states that with ''d3_xboole_0'' and ''t3_xboole_0'' as premises, we can prove ''t5_xboole_0''.<br />
<br />
[[file:mizar40_0.png|frameless|400px|center]]<br />
[[file:mizar40_1.png|frameless|600px|center]]<br />
[[file:mizar40_2.png|frameless|600px|center]]<br />
[[file:mizar40_3.png|frameless|600px|center]]<br />
<br />
== Monte Carlo Guidance ==<br />
<br />
Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. Then the expansion will then be used to weight the node in the search tree.<br />
<br />
In the reinforcement learning setting, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state <math display="inline"> i </math> needs to maintain three parameters, its prior probability <math display="inline"> p_i </math>, its total reward <math display="inline"> w_i </math>, and number of its visits <math display="inline"> n_i </math>. If no policy learning is used, the prior probabilities are all equal to one. <br />
<br />
A simple heuristic is used to estimate the future reward of leaf states: suppose leaf state <math display="inline"> i </math> has <math display="inline"> G_i </math> open subgoals, the reward is computed as <math display="inline"> 0.95 ^ {G_i} </math>. This will be replaced once value learning is implemented.<br />
<br />
The standard UCT formula is chosen to select the next actions in the playouts<br />
\begin{align}<br />
{\frac{w_i}{n_i}} + 2 \cdot p_i \cdot {\sqrt{\frac{\log N}{n_i}}}<br />
\end{align}<br />
where <math display="inline"> N </math> stands for the total number of visits of the parent node.<br />
<br />
The bare prover is asked to play <math display="inline"> b </math> playouts of length <math display="inline"> d </math> from the empty tableaux, each playout backpropagates the values of proof states it visits. After these <math display="inline"> b </math> playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next <math display="inline"> b </math> playouts will start from this tableaux, followed by another bigstep, etc.<br />
<br />
== Policy Learning and Guidance ==<br />
<br />
From many runs of MCT, we will know the optimal prior probability of actions (inferences) in particular proof states, we can extract the frequency of each action <math display="inline"> a </math>, and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion <math display="inline"> r_a \in (0, \infty) </math>. We characterize the proof states for policy learning by extracting human-engineered features. Also, we characterize actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector <math display="inline"> (f_s, f_a) </math>. <br />
<br />
The feature vector <math display="inline"> (f_s, f_a) </math> is regressed against the associated <math display="inline"> r_a </math>.<br />
<br />
During the proof search, the prior probabilities <math display="inline"> p_i </math> of available actions <math display="inline"> a_i </math> in a state <math display="inline"> s </math> is computed as the softmax of their predictions.<br />
<br />
Training examples are only extracted from big step states, making the amount of training data manageable.<br />
<br />
== Value Learning and Guidance ==<br />
<br />
Bigstep states are also used for proof state evaluation. For a proof state <math display="inline"> s </math>, if it corresponds to a successful proof, the value is assigned as <math display="inline"> v_s = 1 </math>. If it corresponds to a failed proof, the value is assigned as <math display="inline"> v_s = 0 </math>. For other scenarios, denote the distance between state <math display="inline"> s </math> and a successful state as <math display="inline"> d_s </math>, then the value is assigned as <math display="inline"> v_s = 0.99^{d_s} </math> <br />
<br />
Proof state feature <math display="inline"> f_s </math> is regressed against the value <math display="inline"> v_s </math>. During the proof search, the reward of leaf states are computed from this prediction.<br />
<br />
== Features and Learners ==<br />
For proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the resulting integer is reduced to a smaller feature space by taking modulo by a large prime.<br />
<br />
For actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.<br />
<br />
In addition to the term walks, they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.<br />
<br />
The whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.<br />
<br />
== Experimental Results ==<br />
In the paper, the dataset they were using is Mizar40. They divided the mizar40 dataset into training and testing set, with a ratio of 9 to 1. According to the author, the split is a random split. During the experiment, the authors' method was able to prove 32524 statements out of 146700 statements. The authors' main approach is transforming the data from First-order logic form into DNF( disjunctive normal form), <br />
The authors use the M2k dataset to compare the performance of mlCoP, the bare prover and rlCoP using only UCT. There were 577 test problems that the rlCop trained. <br />
*Performance without Learning<br />
Table 3 shows the baseline result. The Performance of the bare prover is significantly lower than mlCoP and rlCoP without policy/value.<br />
[[file:table3.png|550px|center]]<br />
*Reinforcement Learning of Policy Only<br />
In this experiment, the authors evaluated on the dataset rlCoP with UCT using policy learning only. They used the policy training data from previous iterations to train a new predictor after each iteration. Which means only the first iteration ran without policy while all the rest iterations used previous policy training data.<br />
From Table 4, rlCoP is better than mlCoP run with the much higher <math>4 ∗ 10^{6}</math> inference limit after fourth iteration. <br />
[[file:table4.png|550px|center]]<br />
*Reinforcement Learning of Value Only<br />
This experiment was similar to the last one, however, they used only values rather than learned policy. From Table 5, the performance of rlCoP is close to mlCoP but below it after 20 iterations, and it is far below rlCoP using only policy learning.<br />
[[file:table5.png|550px|center]]<br />
*Reinforcement Learning of Policy and Value<br />
From Table 6, the performance of rlCoP is 19.4% more than mlCoP with <math>4 ∗ 10^{6}</math> inferences, 13.6% more than the best iteration of rlCoP with policy only, and 44.3% more than the best iteration of rlCoP with value only after 20 iterations.<br />
[[file:table6.png|550px|center]]<br />
Besides, they also evaluated the effect of the joint reinforcement learning of both policy and value. Replacing final policy and value with the best one from policy-only or value-only both decreased performance.<br />
<br />
*Evaluation on the Whole Miz40 Dataset.<br />
The authors split Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, ''mlCoP'' represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and ''bare prover'' stands for the prover implemented in this paper, without MCT guidance.<br />
<br />
[[file:atp_result0.jpg|frane|550px|center|Figure 5a. Experimental result on Mizar40 dataset]]<br />
[[file:atp_result1.jpg|frame|550px|center|Figure 5b. More experimental result on Mizar40 dataset]]<br />
<br />
As shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]<br />
<br />
Besides these results, there were also found that some test problems could be solved with rlCoP easily but mlCoP could not.<br />
<br />
[[file:picture3.png|frame|550px|center|Figure 6: The MCTS tree for the WAYBEL 0:28 problem at the moment when the proof is found. For each node we display the predicted probabilityp, the number of visitsnand the average rewardr=w/n. For the (thicker) nodes leading to the proof the corresponding local proof goals arepresented on the right.]]<br />
<br />
== Conclusions ==<br />
In this work, the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]<br />
<br />
The authors pose that some future research could include strong learning algorithms to characterize mathematical data. The development of suitable deep learning architectures will help the algorithm characterize semantic and syntactic features of mathematical objects which will be crucial to create strong assistants for mathematics and hard sciences.<br />
<br />
== Critiques ==<br />
Until now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.<br />
<br />
The feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior to human-engineered features, which is also shown in [4, 6].<br />
<br />
Also, the policy-value learning iteration is quite inefficient. The learning loop is:<br />
* Loop <br />
** Run MCT with the previous model on an entire dataset<br />
** Collect MCT data<br />
** Train a new model<br />
If we adopt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.<br />
<br />
The experimental design of this paper has some flaws. The authors compare the performance of ''mlCoP'' and ''rlCoP'' by limiting them to the same number of inference steps. However, every inference step of ''rlCoP'' requires additional machine learning prediction, which costs more time. A better way to compare their performance is to set a time limit.<br />
<br />
It would also be interesting to study automated theorem proving in another logic system, like high order logic, because many mathematical concepts can only be expressed in higher-order logic.<br />
<br />
== References ==<br />
[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.<br />
<br />
[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.<br />
<br />
[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.<br />
<br />
[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.<br />
<br />
[5] M. F¨arber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor,<br />
26th International Conference on Automated Deduction (CADE), volume 10395 of LNCS,<br />
pages 563–579. Springer, 2017.<br />
<br />
[6] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.<br />
<br />
[7] Mizar Math Library. http://mizar.org/library/<br />
<br />
[8] J. Urban, G. Sutcliffe, P. Pudla ́k, and J. Vyskocˇil. MaLARea SG1 - Machine Learner for Automated Reasoning with Semantic Guidance. In A. Armando, P. Baumgartner, and G. Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.<br />
<br />
[9] J. Otten and W. Bibel. leanCoP: lean connection-based theorem proving. J. Symb. Comput., 36(1-2):139–161, 2003.<br />
<br />
[10] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Rea-<br />
soning, 3(2):153–245, 2010</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data&diff=42385Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data2018-12-11T00:41:05Z<p>Msminhas: Editorial</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
In highly populated cities with many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of the essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.<br />
<br />
The motivation for this problem is in the context of 911 calls: victims trapped in a tall building who seek immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult. <br />
<br />
In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.<br />
<br />
In large cities with tall buildings, relying on GPS or Wi-Fi signals does not always lead to an accurate location of a caller.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]<br />
[[File:19floor.png|250px]]</div><br />
<br />
<br />
In this work, there are two major contributions. The first is that they trained an LSTM to classify whether a smartphone was either inside or outside a building using GPS, Received signal strength indication (RSSI), and magnetometer sensor readings. The model is compared with baseline models like feed-forward neural networks, logistic regression, SVM, HMM, and Random Forests. The second contribution is an algorithm, which uses the output of the trained LSTM, to predict change in the barometric pressure of the smartphone from when it first entered the building against that of its current location within the building. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.<br />
<br />
The model does not rely on the external sensors placed inside the building, prior knowledge of the building, nor user movement behaviour. The only input it looks at is the GPS and the barometric signal from the phone. Finally, they also talk about the application of this algorithm in a variety of other real-world situations. <br />
<br />
All the codes and data related to this article are available here[[https://github.com/williamFalcon/Predicting-floor-level-for-911-Calls-with-Neural-Networks-and-Smartphone-Sensor-Data]]<br />
<br />
=Related Work=<br />
<br />
<br />
In general, previous work falls under two categories. The first category of methods is the classification methods based on the user's activity. <br />
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.<br />
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.<br />
<br />
Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are: <br />
<ol><br />
<li> Classifier to predict whether the user is indoors or outdoors</li><br />
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li><br />
<li> Classifier to measure the displacement</li><br />
</ol><br />
<br />
One of the downsides of this work is to achieve the high accuracy that the user's step size is needed, therefore heavily relying on pre-training to the specific users. In a real world application of this method, this would not be practical.<br />
<br />
<br />
Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level. <br />
This method also suffers from relying on data specific to the building. <br />
<br />
Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.<br />
<br />
=Method=<br />
<br />
<br />
In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].<br />
<br />
To collect data, the authors developed an iOS application (called Sensory) that runs on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength, GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second, and each datum contained the different sensor measurements mentions earlier along with environment contexts like building floors, environment activity, city name, country name, and magnetic strength.<br />
<br />
The data collection procedure for '''indoor-outdoor classifier''' was described as follows:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) As soon as we exit, set indoors to 0. 7) Stop recording. 8) Save data as CSV for analysis. This procedure can start either outside or inside a building without loss of generality.<br />
<br />
The following procedure generates data used to '''predict a floor change''' from the entrance floor to the end floor:<br />
1) Start outside a building. 2) Turn Sensory on, set indoors to 0. 3) Start recording. 4) Walk into and out of buildings over the next n seconds. 5) As soon as we enter the building (cross the outermost door) set indoors to 1. 6) Finally, enter a building and ascend/descend to any story. 7) Ascend through any method desired, stairs, elevator, escalator, etc. 8) Once at the floor, stop recording. 9) Save data as CSV for analysis.<br />
<br />
Their algorithm was used to predict floor level which is a 3 part process:<br />
<br />
<ol><br />
<li> Classifying whether smartphone is indoor or outdoor </li><br />
<li> Indoor/Outdoor Transition detector</li><br />
<li> Estimating vertical height and resolving to absolute floor level </li><br />
</ol><br />
<br />
==1) Classifying Indoor/Outdoor ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div><br />
<br />
From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure (<math>P</math>), GPS vertical accuracy (<math>GV</math>), GPS horizontal accuracy (<math>GH</math>), GPS speed (<math>S</math>), device RSSI level (<math>rssi</math>), and magnetometer total reading (<math>M</math>).<br />
<br />
The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math><br />
<br />
<br />
<div style="text-align: center;">Total Magnetic field strength <math>= M = \sqrt{x^{2} + y^{2} + z^{2}}</math></div><br />
<br />
They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.<br />
<br />
In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.<br />
<br />
For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Table5.png|750px]] </div><br />
<br />
This is a critical part of their system and they only focus on the predictions in the subspace of being indoors. <br />
<br />
They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>. <br />
<br />
The cost function is shown below:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|450px]] </div><br />
<br />
The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.<br />
<br />
==2) Transition Detector ==<br />
<br />
Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div><br />
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math><br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|250px]] </div><br />
Jacard Distance:<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|450px]]</div><br />
<br />
After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.<br />
<br />
==3) Vertical height and floor level ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div><br />
<br />
[3] suggested the use of a reference barometer or beacons as a way to determine the entrances to a building.<br />
<br />
However, such need is eliminated by the authors' approach. The authors' second key contribution is to use the LSTM IO predictions to help identifying these indoor transitions into the building. The LSTM provides a self-contained estimator of a building’s entrance without relying on external sensor information on a user’s body or beacons placed inside a building’s lobby. [4]<br />
<br />
In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.<br />
<br />
The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math><br />
<br />
After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].<br />
<br />
In order to resolve to an absolute floor level, they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.<br />
<br />
=Experiments and Results=<br />
<br />
==Dataset==<br />
<br />
In this paper, an iOS app called Sensory is developed which is used to collect data on an iPhone 6. The following sensor readings were recorded: '''indoors''', '''created at''', '''session id''', '''floor''', '''RSSI strength''', '''GPS latitude''', '''GPS longitude''', '''GPS vertical accuracy''', '''GPS horizontal accuracy''', '''GPS course''', '''GPS speed''', '''barometric relative altitude''', '''barometric pressure''', '''environment context''', '''environment mean building floors''', '''environment activity''', '''city name''', '''country name''', '''magnet x''', '''magnet y''', '''magnet z''', '''magnet total'''.<br />
<br />
As soon as the user enters or exits a building, the indoor-outdoor data has to be manually entered. To gather the data for the floor level prediction, the authors conducted 63 trials among five different buildings throughout New York City. Since unsupervised learning was being used, the actual floor level was recorded manually for the validation purposes only.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|450px]] </div><br />
<br />
All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation. <br />
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006. <br />
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|650px]] </div><br />
<br />
The above chart shows the success that their system is able to achieve in the floor level prediction.<br />
<br />
The performance was measured in terms of how many floors were travelled rather than the absolute floor number. Because different buildings might have their floors differently numbered. They used different m values in 2 tests. One applies the same m value across all building and the other one applied specific m values on different buildings. The result showed that this specification on m values hugely increased the accuracy.<br />
<br />
=Future Work=<br />
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate the floor level predictions solely from sensor reading data.<br />
<br />
=Critique=<br />
<br />
In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment. <br />
<br />
With rising number of smartphone users, cellular network capacity is reaching its limits and not able to cater to multiple users. One of the major concerns being the indoor cellular coverage and seamless mobility between indoor-outdoor cellular networks. The proposed solution can enable in providing this connectivity between cells for example handover between an indoor pico/nano cell, Wi-Fi network to an outdoor macro cell network; moreover with the floor detection algorithm, connectivity can be improved for users under low coverage areas such as basements, underground car parking, etc. Hence this can be integrated into one of the 5G use cases for improved network coverage. <br />
<br />
A weakness is that they claim they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%. Also, the article's ideas are sometimes out of order and are repeated in cycles.<br />
<br />
It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.<br />
<br />
They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].<br />
<br />
The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm is implemented; otherwise, the algorithm may provide the misleading data to rescue crews, etc.<br />
<br />
Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Moreover, authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.<br />
<br />
The proposed model could introduce privacy risks such as illegal surveillance of mobile phone user and private facilities.<br />
<br />
=Potential Pitfall of the System=<br />
<br />
One of the main criticisms for barometric pressure based systems is the unpredictability of barometric pressure as a sensor measurement due to external factors and changing weather conditions.<br />
<br />
=References=<br />
<br />
[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):<br />
1735–1780, 1997.<br />
<br />
[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.<br />
<br />
[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.<br />
<br />
[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018<br />
<br />
[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).<br />
<br />
[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.<br />
<br />
[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Searching_For_Efficient_Multi_Scale_Architectures_For_Dense_Image_Prediction&diff=42384Searching For Efficient Multi Scale Architectures For Dense Image Prediction2018-12-11T00:38:20Z<p>Msminhas: Editorial</p>
<hr />
<div><br />
[Need add more pics and references]<br />
=Introduction=<br />
<br />
The design of neural network architectures is an important component for the success of machine learning and data science projects. In recent years, the field of Neural Architecture Search (NAS) has emerged, which is to automatically find an optimal neural architecture for a given task in a well-defined architecture space. The resulting architectures have often outperformed networks designed by human experts on tasks such as image classification and natural language processing. [2,3,4] <br />
<br />
This paper presents a meta-learning technique to have computers search for a neural architecture that performs well on the task of dense image segmentation, mainly focused on the problem of scene labeling.<br />
<br />
=Motivation=<br />
<br />
The part of deep neural networks(DNN) success is largely due to the fact that it greatly reduces the work in feature engineering. This is because DNNs have the ability to extract useful features given the raw input. However, this creates a new paradigm to look at - network engineering. In order to extract significant features, an appropriate network architecture must be used. Hence, the engineering work is shifted from feature engineering to network architecture design for better abstraction of features.<br />
<br />
The motivation for NAS is to establish a guiding theory behind how to design optimal network architecture. Given that there is an <br />
abundant amount of computational resources available, an intuitive solution is to define a finite search space for a computer to search for optimal network structures and hyperparameters.<br />
<br />
=Related Work =<br />
<br />
This paper focuses on two main literature research topics. One is the neural architecture search (NAS) and the other is the Multi-Scale representation for dense image prediction. Neural architecture search trains a controller network to generate neural architectures. The following are the important research directions in this area: <br />
<br />
1) One kind of research transfers architectures learned on a proxy dataset to more challenging datasets and demonstrates superior performance over many human-invented architectures.<br />
<br />
2) Reinforcement learning, evolutionary algorithms and sequential model-based optimization have been used to learn network structures. <br />
<br />
3) Some other works focus on increasing model size, sharing model weights to accelerate model search or a continuous relaxation of the architecture representation. <br />
<br />
4) Some recent methods focus on proposing methods for embedding an exponentially large number of architectures in a grid arrangement for semantic segmentation tasks. <br />
<br />
In the area of multi-scale representation for dense image prediction the following are useful prior work: <br />
<br />
1) State of the art methods use Convolutional Neural Nets. There are different methods proposed for supplying global features and context information to perform pixel level classification. <br />
<br />
2) Some approaches focus on how to efficiently encode multi-scale context information in a network architecture like designing models that take an input an image pyramid so that large-scale objects are captured by the downsampled image. <br />
<br />
3) Research also tried to come up with a theme on how best to tune the architecture to extract context information. Some works focus on sampling rates in atrous convolution to encode multi-scale context. Some others build context module by gradually increasing the rate on top of belief maps.<br />
<br />
=NAS Overview=<br />
<br />
NAS essentially turns a design problem into a search problem. As a search problem in general, we need a clear definition of three things:<br />
<ol><br />
<li> Search space</li><br />
<li> Search strategy</li><br />
<li> Performance Estimation Strategy</li><br />
</ol><br />
The search space is easy to understand, for instance defining a hyperparameter space to consider for our optimal solution. In the field of NAS, the search space is heavily dependent on the assumptions we make on the neural architecture. The search strategy details how to explore the search space. The evaluation strategy refers to taking an input of a set of hyperparameters, and from there evaluating how well our model fits. In the field of NAS, it is typical to find architectures that achieve high predictive performance on unseen data. [5]<br />
<br />
We will take a deep dive into the above three dimensions of NAS in the following sections<br />
<br />
=Search Space=<br />
The purpose of architecture search space is to design a space that can express various state-of-the-art architectures, and able to identify good models.<br />
<br />
There are typically three ways of defining the search space.<br />
==Chain-structured neural networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen_Shot_2018-11-10_at_6.03.00_PM.png|150px]]<br />
</div><br />
[5]<br />
The chain structed network can be viewd as sequence of n layers, where the layer <math> i</math> recives input from <math> i-1</math> layer and the output serves<br />
the input to layer <math> i+1</math>.<br />
<br />
The search space is then parametrized by:<br />
1) Number of layers n<br />
2) Type of operations can be executed on each layer<br />
3) Hyperparameters associated with each layer<br />
<br />
==Multi-branch networks ==<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.08 PM.png|400px]]</div><br />
<br />
[5]<br />
This architecture allows significantly more degrees of freedom. It allows shortcuts and parallel branches. Some of the ideas are inspired by human hand-crafted networks. For example, the shortcut from shallow layers directly to the deep layers are coming from networks like ResNet [6]<br />
<br />
The search space includes the search space of chain-structured networks, with additional freedom of adding shortcut connections and allowing parallel branches to exist.<br />
<br />
==Cell/Block ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.03.31 PM.png|600px]]</div><br />
<br />
[6]<br />
This architecture defines a cell which is used as the building block of the neural network. A good analogy here is to think a cell as a lego piece, and you can define different types of cells as different<br />
lego pieces. And then you can combine them together to form a new neural structure. <br />
<br />
The search space includes the internal structure of the cell and how to combine these blocks to form the resulting architecture.<br />
<br />
==What they used in this paper ==<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.50.04 PM.png|500px]]<br />
</div><br />
[1]<br />
This paper's approach is very close to the Cell/Block approach above<br />
<br />
The paper defines two components: The "network backbone" and a cell unit called "DPC" which represented by a directed acyclic graph (DAG) with five branches (i.e. the optimal value, which gives a good balance between flexibility and computational tractability). A DAG is a finite directed graph with no directed cycles which consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex <math>v</math> and follow a consistently-directed sequence of edges that eventually loops back to <math>v</math> again. The network backbone's job is to take input image as a tensor and return a feature map f that is a supposedly good abstraction of the image. The DPC is what they introduced in this paper, short for Dense Prediction Cell, that is a recursive search space to encode multi-scale context information for dense prediction tasks. In theory, the search space consists of what they choose for the network backbone and the internal structure of the DPC. In practice, they just used MobileNet and Modified Xception net as the backbone. So the search space only consists of the internal structure of the DPC cell.<br />
<br />
For the network backbone, they simply choose from existing mature architecture. They used networks like Mobile-Net-v2, Inception-Net, and e.t.c. For the structure of DPC, they define a smaller unit of called branch. A branch is a triple of (Xi, OP, Yi), where Xi is an input tensor, and OP is the operation that can be done on the tensor, and Yi is the resulting after the Operation. <br />
<br />
In the paper, they set each DPC consists of 5 cells for the balance expressivity and computational tractability.<br />
<br />
The operator space, OP, is defined as the following set of functions:<br />
<ol><br />
<li>Convolution with a 1 × 1 kernel.</li><br />
<li>3×3 atrous separable convolution with rate rh×rw, where rh and rw ∈ {1, 3, 6, 9, . . . , 21}. </li><br />
<li>Average spatial pyramid pooling with grid size gh × gw, where gh and gw ∈ {1, 2, 4, 8}. </li><br />
</ol><br />
<br />
For the spatial pyramid pooling operation, average pooling is performed in each grid. After the<br />
average pooling, a 1×1 convolution is applied and the then the resize back the features to have the same spatial resolution as the input tensor.<br />
<br />
Separable convolution with 256 filters is employed for all convolutions and 3x3 atrous convolutions with sampling rates rh x rw allows for capturing object scales with different aspect ratios. This is illustrated in the diagram below: <br />
<br />
[[File:NAS_fig2.png|center|500px]]<br />
<br />
Average spatial pyramid pooling performs mean pooling on the last convolution layer (either convolution or sub sampling) and produces a N*B dimensional vector (where N=Number of filters in the convolution layer, B= Number of Bins). The vector is in turn fed to the fully connected layer. The number of bins is a constant value. Therefore, the vector dimension remains constant irrespective of the input image size.<br />
<br />
The resulting search space is able to encode all the main state-of-the-art architectures(i.e. Deformable Convnets [11], ASPP, Dense-ASPP [12] etc.), but these encoded architectures are more diverse since each branch of a DPC cell could build contextual information through parallel or cascaded representations. The number of potential architectures may determine the potential diversity of the search space. For <math display="inline">i</math>-th branch, there are <math display="inline">i</math> possible inputs, including the last feature maps produced by the network backbone, all the outputs from previous branch (<math display="inline">i.e., Y_1,...,Y_{i-1}</math>), and also 1 + 8×8 + 4×4 = 81 functions in the operator space, resulting in <math display="inline">i × 81</math> possible options. Therefore, for B = 5, the search space size is B! × 81^B ≈ 4.2 × 10^11 configurations.<br />
<br />
[[File:picture2.png|center|500px]]<br />
<br />
=Search Strategy=<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:search_strategy.png|600px]]<br />
</div><br />
<br />
There are some common search strategies used in the field of NAS, such as Reinforcement learning, Random search, Evolution algorithm, and Grid Search.<br />
<br />
The one they used in the paper is Random Search. It basically samples points from the search space uniformly at random as well as sampling<br />
some points that are close to the current observed best point. Intuitively it makes sense because it combines exploration and exploitation. When you sample points close to the current<br />
optimal point, you are doing exploitation. And when you sample points randomly, you are doing exploration.<br />
<br />
The pseudocode for a general random search algorithm is provided below.<br />
<br />
[[File:Pseudoc.png |700px|center]]<br />
<br />
It essentially repeatedly searches randomly within the hypersphere of the current state, and updates only if the reward function is increased when using the newly found vector. The approach is highly non-parametric, and is easily generalized for complex problems such as architectural finding once parameters are properly defined. Although Random Search can return a reasonable approximation of the optimal solution under low problem dimensionality, the approach is commonly cited to perform poorly under higher problem dimensionality. The implementation of Random Search within this context is used to find highly complex architectures with millions of parameters; this could explain the only marginal improvements to human created state-of-the art networks despite the heavy machinery used to arrive at new architectures in the experiments section.<br />
<br />
They quoted from another paper that claims random search performs the random search is competitive with reinforcement learning and other learning techniques [7]. In the implementation, they used Google's black box optimization tool Google vizier. It is not open source, but there is an open source implementation of it [8]. A more recent and detailed survey detailing other methods such as Bayesian optimization strategies for Neural architecture search can be found in [13]<br />
<br />
=Performance Evaluation Strategy=<br />
<br />
The evaluation in this particular task is very tricky. The reason is we are evaluating neural network here. In order to evaluate it, we need to train it first. And we are doing pixel level classification on images with high resolutions, so the naive approach would require a tremendous amount of computational resources. <br />
<br />
The way they solve it in the paper is defining a proxy task. The proxy task is a task that requires sufficient less computational resources, while can still give a good estimate of the performance of the network. In most image classical tasks of NAS, the proxy<br />
task is to train the network on images of lower resolution. The assumption is, if the network performs well on images with lower density, it should reasonably perform well on images with higher resolution.<br />
<br />
However, the above approach does not work on this case. The reason is that the dense prediction tasks innately require high-resolution images as training data. The approach used in the paper is the flowing:<br />
<ol><br />
<li> Use a smaller backbone for proxy task</li><br />
<li> caching the feature maps produced by the network backbone on the training set and directly building a single DPC on top of it </li><br />
<li> Early stopping train for 30k iterations with a batch size of 8</li><br />
</ol><br />
<br />
If training on the large-scale backbone without fixing the weights of the backbone, they would need one week to train a network on a P100 GPU, but now they cut down the proxy task to be run 90 min. Then they rank the selected architectures, choosing the top 50 and do <br />
a full evaluation on it.<br />
<br />
The evaluation metric they used is called mIOU, which is pixel level intersection over union. Which just the area of the intersection<br />
of the ground truth and the prediction over the area of the union of the ground truth and the prediction.<br />
<br />
=Result=<br />
<br />
This method achieves state of art performances in many datasets. The following table quantifies the gain on performance on many datasets.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.51.14 PM.png| 800px]]<br />
</div><br />
The chose to train on modified Xception network as a backbone, and the following are the resulting architecture for the DPC.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-12 at 12.32.05 PM.png|1000px]]<br />
</div><br />
<br />
Table 2 describes the results on scene parsing dataset. It sets a new state-of-the-art performance of 82.7% mIOU and outperforms other state-of-the-art models across 11 of the 19 categories.<br />
<br />
Table 3 describes the results on person part segmentation dataset. It achieve the state-of-the-art performance of 71.34% mIOU and outperforms other state-of-the-art models across 6 of the 7 categories.<br />
<br />
Table 4 describes the results on semantic image segmentation dataset. It achieve the state-of-the-art performance of 87.9% mIOU and outperforms other state-of-the-art models across 6 of the 20 categories.<br />
<br />
As we can see, the searched DPC model achieves better performance (measured by mIOU) with less than half of the computational resources(parameters), and 37% less of operations (add and multiply).<br />
<br />
=Future work=<br />
The author suggests that when increasing the number of branches in the DPC, there might be a further gain on the performance on the<br />
image segmentation task. However, although the random search in an exponentially growing space may become more challenging. There may need more intelligent search strategy. They hope that by using some meta learning on metadata it can lead to future insight and be advantageous. <br />
<br />
The author hope that this architecture search techniques can be ported into other domains such as depth prediction and object detection to achieve similar gains over human-invented designs.<br />
<br />
=Critique=<br />
<br />
1. Rich man's game<br />
<br />
The technique described in the paper can only be applied by parties with abundant computational resources, like Google, Facebook, Microsoft, and e.t.c. For small research groups and companies, this method is not that useful due to the lack of computational power. Future improvement will be needed on the design an even more efficient proxy task that can tell whether a network will perform<br />
well that requires fewer computations. <br />
<br />
2. Benefit/Cost ratio<br />
<br />
The technique here does outperform human designed network in many cases, but the gain is not huge. In Cityscapes dataset, the performance gain is 0.7%, wherein PASCAL-Person-Part dataset, the gain is 3.7%, and the PASCAL VOC 2012 dataset, it does not outperform human experts. (All measured by mIOU) Even though the push of the state-of-the-art is always something that worth celebrating, <br />
but in practice, one would argue after spending so many resources doing the search, the computer should achieve superhuman performance. (Like Chess Engine vs Chess Grand Master). In practice, one may simply go with the current state-of-the-art model to avoid the expensive search cost.<br />
<br />
3. Still Heavily influenced by Human Bias<br />
<br />
When we define the search space, we introduced human bias. Firstly, the network backbone is chosen from previous matured architectures, which may not actually be optimal. Secondly, the internal branches in the DPC also consist with layers whose operations are defined by us humans, and we define these operations based on previous experience. That also prevents the search algorithm to find something revolutionary.<br />
<br />
4. May have the potential to take away entry-level data science jobs.<br />
<br />
If there is a significant reduction in the search cost, it will be more cost effective to apply NAS rather than hire data scientists. Once matured, this technology will have the potential to take away entry-level data science jobs and make data science jobs only possessed by high-level researchers. <br />
<br />
There are some real-world applications that already deploy NAS techniques in production. Two good examples are Google AutoML and Microsoft Custom Vision AI.<br />
[9, 10]<br />
<br />
=References=<br />
1. Searching For Efficient Multi-Scale Architectures For Dense Image Prediction, [[https://arxiv.org/abs/1809.04184]].<br />
<br />
2. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018.<br />
<br />
3. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.<br />
<br />
4. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.<br />
<br />
5. Neural Architecture Search: A Survey [[https://arxiv.org/abs/1808.05377]]<br />
<br />
6. Deep Residual Learning for Image Recognition [[https://arxiv.org/pdf/1512.03385.pdf]]<br />
<br />
7. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.<br />
In the implementation wise, they used a Google vizier, which is a search tool for black box optimization. [D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In SIGKDD, 2017.]<br />
<br />
8. Github implementation of Google Vizer, a black-box optimization tool [https://github.com/tobegit3hub/advisor.]<br />
<br />
9. AutoML: https://cloud.google.com/automl/ <br />
<br />
10. Custom-vision: https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/<br />
<br />
11. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.<br />
<br />
12. M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.<br />
<br />
13. Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. "Neural architecture search: A survey." arXiv preprint arXiv:1808.05377 (2018).</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation&diff=42383Zero-Shot Visual Imitation2018-12-11T00:35:00Z<p>Msminhas: Editorial</p>
<hr />
<div>This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018. <br />
<br />
==Introduction==<br />
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from. In this paper, an alternative<br />
paradigm is pursued wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss.<br />
Videos, models, and more details are available at [[https://pathak22.github.io/zeroshot-imitation/]].<br />
<br />
===Paper Overview===<br />
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images, instead of observation-action pairs. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.<br />
<br />
[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.]]<br />
<br />
This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'', and is learned by re-labeling states that the agent visited as goals and the actions the agent taken as prediction targets via self-supervised way. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.<br />
<br />
A major challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'', which essentially says that reaching the goal is more important than how it is reached. First, a forward model that predicts the next observation from the given action and current observation is learned. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss does not inadvertently penalize actions that are ''consistent'' with the ground-truth action, even though the actions are not exactly the same (but lead to the same next state). <br />
<br />
As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.<br />
<br />
Of course, when introducing something like forward-consistent loss, issues related to the number of steps needed to reach a certain goal become of interest since different goals require different number of steps. To address this, the paper pairs the GSP with a goal recognizer (as an optimizer) to determines whether the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram (d) showing the forward-consistent loss proposed in this paper.<br />
<br />
The paper refers to this method as zero-shot, as the agent never has access to expert actions regardless of being in the training or task demonstration phase. This is different from one-shot imitation learning, where agents have full knowledge of actions and expert demos during the training phase. The agent learns to imitate instead of learning by imitation. The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation, and a series of navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.<br />
<br />
===Related Work===<br />
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.<br />
<br />
Imitation Learning: The two main threads are behavioral cloning and inverse reinforcement learning. For recent work in imitation learning, it required the expert actions to expert actions. Compared with this paper, it does not need this.<br />
<br />
Visual Demonstration: Several papers focused on relaxing this supervision to visual observations alone and the end-to-end learning improved results.<br />
<br />
Forward/Inverse Dynamics and Consistency: Forward dynamics model for planning actions has been learned but there is not consistent optimizer between the forward and inverse dynamics.<br />
<br />
Goal Conditioning: In this paper, systems work from high-dimensional visual inputs instead of knowledge of the true states and do not use a task reward during training.<br />
<br />
==Learning to Imitate Without Expert Supervision==<br />
<br />
In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described. <br />
<br />
Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment. This exploration data is used to learn the GSP policy.<br />
<br />
<br />
<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div><br />
<br />
<br />
The learned GSP policy (<math display="inline">π</math>) takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs a sequence of actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> starting from the current observation <math display="inline">x_i</math>. The states (observations) <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and need not be consecutive. Given the start and stop states, the number of actions <math display="inline">K</math> is also known. <math display="inline">π</math> can be though of as a deep network with parameters <math display="inline">θ_π</math>. <br />
<br />
At test time, the expert demonstrates a task from which the agent captures a sequence of observations. This set of images is denoted by <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math>. The sequence needs to have at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses its learned policy to start from initial state <math display="inline">x_0</math> and generate actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to follow the observations in <math display="inline">D</math>.<br />
<br />
The agent does not have access to the sequence of actions performed by the expert. Hence, it must use the observations to determine if it has reached the goal. A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the current goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes <br />
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. Then after reaching sufficiently close to <math display="inline">x_1^d</math>, the agent sets <math display="inline">x_2^d</math> as the goal and executes actions. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.<br />
<br />
===Learning the Goal-Conditioned Skill Policy (GSP)===<br />
<br />
In this section, first, the one-step version GSP policy is described. Next, it is extend it to the multi-step version. <br />
<br />
A one-step trajectory can be described as <math display="inline">(x_t; a_t; x_{t+1})</math>. Given <math display="inline">(x_t, x_{t+1})</math> the GSP policy estimates an action, <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math>. During training, cross-entropy loss is used to learn GSP parameters <math display="inline">θ_π</math>:<br />
<br />
<br />
<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div><br />
<br />
<br />
<math display="inline">a_t</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted actions respectively. The conditional distribution <math display="inline">p</math> is not readily available so it needs to be empirically approximated using the data. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math>; given a specific input, the network outputs a single output. However, in this problem multiple actions can lead to the same output. Multiple outputs given a single input can be modeled using a variation auto-encoder. However, the authors use a different approach explained in sections 2.2-2.4 and in the following sections.<br />
<br />
===Forward Consistency Loss===<br />
<br />
To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (the observation from executing the action predicted by GSP <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients (for actions that result in the same next observation) and aid the learning process. This is what is denoted as ''forward consistency loss''.<br />
<br />
To operationalize the forward consistency loss, we need a differentiable "forward dynamics" model that can reliably predict results of an action. The forward dynamics <math display="inline">f</math> are learned from the data by another model. Given an observation and the action performed, <math display="inline">f</math> predicts the next observation, <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency with the GSP network. In summary, the loss function is given below:<br />
<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div><br />
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div><br />
<br />
Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust. This paper incorporates this by making the GSP predict feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math> rather than the input space. <br />
<br />
Learning the two models <math>θ_π,θ_f</math> simultaneously from scratch can cause noisier gradient updates. This is addressed by pre-training the forward model with the first term and GSP separately by blocking gradient flow. Fine-tuning is then done with <math>θ_π,θ_f</math> jointly. <br />
<br />
The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:<br />
<br />
<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div><br />
<br />
<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div><br />
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div><br />
<br />
<br />
The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.<br />
<br />
===Goal Recognizer===<br />
<br />
The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem that given an observation <math>x_i</math>, goal <math>x_g</math> infers whether <math>x_i</math> is close to <math>x_g</math>. Goal observations is drawn at random from the agent's experience due to lack of expert supervision of the goals, using those observations is because they are feasible. Additionally, a maximum number of iterations is also used to prevent the sequence of actions from getting too long.<br />
<br />
The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss. <br />
<br />
The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.<br />
<br />
===Ablations and Baselines===<br />
<br />
To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space. <br />
<br />
To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section: <br />
<br />
# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM. <br />
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup. <br />
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss. <br />
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forwarding consistency objective. <br />
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.<br />
# '''GSP''' refers to the complete method with all the components.<br />
<br />
==Experiments==<br />
<br />
The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.<br />
<br />
===Rope Manipulation===<br />
<br />
Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.<br />
<br />
In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper. <br />
<br />
For this experiment, the Baxter robot is set up exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.<br />
<br />
The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.<br />
<br />
The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.<br />
<br />
[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The<br />
robotics system setup. (b) The sequence of human demonstration images provided by the human<br />
during inference for the task of knot-tying (top row), and the sequences of observation states reached<br />
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human<br />
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’<br />
shape. Our agent is able to successfully imitate the demonstration.]]<br />
<br />
[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]<br />
<br />
===Navigation in Indoor Office Environments===<br />
In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.<br />
<br />
The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.<br />
<br />
Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.<br />
<br />
[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image<br />
(top-left). Since the initial and goal image has no overlap, the robot first explores the environment<br />
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42<br />
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and<br />
such exploratory behavior naturally emerged from the self-supervised learning.]]<br />
<br />
[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image<br />
of goal in an unseen environment. Each column represents a different run of our system for a<br />
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given<br />
a successful run but reaches the goal successfully at a much higher rate.]]<br />
<br />
Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.<br />
<br />
[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of<br />
images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration<br />
has no overlap with its current observation. Even under this condition, the robot is able to move closer<br />
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration<br />
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint<br />
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from<br />
WayPoint-1.]]<br />
<br />
[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in<br />
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three<br />
runs of two different demonstrations. Results show that our method outperforms the baselines. Note<br />
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions<br />
and neither model succeeded. Detailed results are available in the supplementary materials.]]<br />
<br />
===3D Navigation in VizDoom===<br />
<br />
To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space. <br />
<br />
Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.<br />
<br />
Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.<br />
<br />
[[File:8-Table3.png | 650px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual<br />
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for<br />
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]<br />
<br />
==Discussion==<br />
<br />
This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.<br />
<br />
A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen. <br />
<br />
The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.<br />
<br />
This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work. Future work would also explore how multiple tasks can be combined into a single model, where different tasks might come from different contexts. Finally, it would be exciting to explore explicit handling of domain shift in future work, so as to handle large differences in embodiment and learn skills directly from videos of human demonstrators obtained, for example, from the Internet.<br />
<br />
==Critique==<br />
1. The paper is well written and could be easily understood. In addition, the experimental evaluations are promising. Also, the proposed method is a novel and interesting so that it could be used as an alternative to pure RL. <br />
<br />
2. In the paper, the authors didn't mention clearly why zero-shot imitation instead of a trained reinforcement learning model should be used. So, they need to provide more details about this issue.<br />
<br />
3. It is surprised that experimental evaluations on real robots. However, the scalability of this paper is not demonstrated, how to extend it to higher dimensional action spaces and whether it is expensive in high dimensional action spaces.<br />
<br />
4. I think having another test where the goal is fixed and the robot remains in its original position would show some interesting insight. Even having the obstacles move around would be some possible to integrate in the test.<br />
<br />
==References==<br />
<br />
[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.<br />
<br />
[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning<br />
from demonstration. Robotics and autonomous systems, 2009.<br />
<br />
[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood<br />
Cliffs, NJ, 1977.<br />
<br />
[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke<br />
by poking: Experiential learning of intuitive physics. NIPS, 2016.<br />
<br />
[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination<br />
for robotic grasping with large-scale data collection. In ISER, 2016.<br />
<br />
[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and<br />
700 robot hours. ICRA, 2016.<br />
<br />
[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey<br />
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.<br />
ICRA, 2017.<br />
<br />
[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration<br />
by self-supervised prediction. In ICML, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent&diff=42382A Bayesian Perspective on Generalization and Stochastic Gradient Descent2018-12-11T00:32:54Z<p>Msminhas: Editorial</p>
<hr />
<div>==Introduction==<br />
This paper shows how Bayesian principles can explain many recent observations in the deep learning literature, and provide practical new insights. This work builds on Zhang et al.(2016) who showed that deep convolutional neural networks continue to perform well on the training data even after random relabelling of the input data points, but its ability to generalize and performance on the test data points goes down after the random relabelling. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? <br />
<br />
The paper shows that the same phenomenon occurs even in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. They also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.<br />
<br />
The authors propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” <math display="inline"> g \approx \epsilon N/B </math> where <math display="inline">ε</math> is the learning rate, <math display="inline">N</math> the training set size and <math display="inline">B</math> the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, <math display="inline">B_{opt} \propto \epsilon N</math>. The authors verify these predictions empirically.<br />
<br />
==Motivation and Related Work==<br />
Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? <br />
<br />
Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving the estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds.<br />
<br />
Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.<br />
<br />
==Contribution==<br />
<br />
The main contributions of this paper are to show that:<br />
* The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. Overparameterization occurs when a model is able to effectively “remember” training data. This occurs when there are enough parameters that the system of equations ends up with an infinite number of possible solutions. One can see why this over-training would lead to poor results in test cases, as this “memorization” learns noise as opposed to the inherent structure of different classes. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.<br />
* SGD integrates a stochastic differential equation whose “noise scale” <math>g &asymp; &epsilon;N/B</math>, where <math>\epsilon</math> is the learning rate, <math>N</math> training set size, and <math>B</math> batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is '''proportional to the learning rate and training set size'''.<br />
<br />
Zhang et al. (2016) showed high training competency of neural networks under informative labels, but drastic overfitting on improper labels. This implies weak generalizability even when a small proportion of labels are improper. The authors show that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Bayesians tend to make distributional assumptions on gradient updates by adding isotropic Gaussian noise. This paper builds upon these Bayesian principles by driving SGD away from sharp minima, and towards broad minima (the more broad, the better generalization due to less influence from small perturbations within input). The stochastic differential equation used as a component of gradient updates effectively serves as injected noise that improves a network's generalizability.<br />
<br />
==Main Results==<br />
<br />
The weakly regularized model memorizes random labels, however, generalizes properly on informative labels. Besides, the predictions are overconfident. The authors also showed that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is postulated that the optimum represents a tradeoff between depth and breadth in the Bayesian evidence. However, it is the underlying scale of random fluctuations in the SGD dynamics which controls the tradeoff, not the batch size itself. Furthermore, this test accuracy peak shifts as the training set size rises. The authors observed that the best found batch size is proportional to the learning rate. This scaling rule allowed the authors to increase the learning rate by simultaneously increasing the batch size with no loss in test accuracy and no increase in computational cost, thus parallelism across multiple GPU's can be fully leveraged to easily decrease training time. The scaling rule could also be applied to production models by consequentially increasing the batch size as new training data is introduced.<br />
<br />
==Bayesian Model Comparison==<br />
<br />
===Introduction to Bayesian Statistics===<br />
Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events <math>A</math> and <math>B</math>, the conditional probability of <math>A</math> given <math>B </math> is true, Bayes theorem states that<br />
\begin{align*}\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\end{align*}<br />
<br />
Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if <math>m </math> parent nodes represent <math>m </math> Boolean variables then the probability function could be represented by a table of <math>2^{m} </math> entries, one entry for each of the <math>2^{m} </math> possible parent combinations. <br />
<br />
===Bayesian Model Comparison in Neural Networks===<br />
MacKay (1992) applied Bayesian model comparison to neural networks. An overview is presented below. <br />
<br />
We first consider a classification model <math>M </math> with a single parameter <math>\omega </math>, training inputs <math>x </math> and training labels <math>y </math>. We can infer a posterior probability distribution over the parameter by applying Bayes theorem :<br />
<br />
\begin{align*}P(\omega\mid y,x;M) = \frac{P(y\mid \omega,x;M)P(\omega;M) }{P(y\mid x;M)}\end{align*}<br />
<br />
The likelihood, <math>P(y\mid \omega,x;M) = \Pi_i P(y_i\mid \omega,x_i;M) = e^{-H(\omega;M)} </math>, where <math>H(\omega;M) </math> denotes the cross-entropy of unique categorical labels. Using a Gaussian prior, <math>P(\omega;M) = \sqrt{\lambda/2\pi e^{-\lambda\omega^2/2}} </math>, and therefore the posterior probability density of the parameter given the training data, <math>P(\omega\mid y,x;M) \propto \sqrt{\lambda/2\pi e^{-C(\omega;M)}} </math>, where <math>C(\omega;M) = H(\omega;M) + \lambda\omega^2/2 </math> denotes the L2 regularized cross entropy, or “cost function”, and <math>\lambda </math> is the regularization coefficient. <br />
<br />
The value <math>\omega_0 </math> which minimizes the cost function lies at the maximum of this posterior. To predict an unknown label <math>y_t </math> of a new input <math>x_t </math>, we should compute the integral,<br />
<br />
\begin{align*} P(y_t\mid x_t,y,x;M) &= \int \frac{d\omega P(y_t\mid \omega,x_t;M)}{P(\omega\mid y,x;M)}\\ &= \frac{\int d \omega P(y_t \mid \omega ,x_t;M)e^{-C(\omega;M)}}{\int d \omega e^{-C(\omega;M)}} \end{align*}</math><br />
<br />
However, these integrals are dominated by the region near <math>\omega_0 </math> . We usually approximate <math>P(y_t\mid x_t,x,y;M) \approx P(y_t\mid \omega_0,x_t;M) </math>. Having minimized <math>C(\omega;M) </math> to find <math>\omega_0 </math>, we now wish to compare two different models and select the best one. We use the probability ratio<br />
<br />
\begin{align*}\frac{P(M_1\mid y,x)}{P (M_2\mid y, x)} = \frac{P(y\mid x;M_1) P(M_1)}{ P (y\mid x; M_2) P (M_2)} . \end{align*} <br />
<br />
The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile, the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs<br />
<br />
Germain et al. (2016) showed that maximizing the evidence (or “marginal likelihood”) minimizes a PAC-Bayes generalization bound. To compute it, we evaluate <br />
\begin{align*}P(y\mid x;M) &= \int d\omega P(y\mid \omega,x;M)P(\omega;M) \\ &=\sqrt{\frac{\lambda}{2\pi}}\int d \omega e^{C(\omega;M)}\end{align*}<br />
<br />
Notice that the evidence is computed by integrating out the parameters; and consequently, it is invariant to the model parameterization. <br />
Since this integral is dominated by the region near the minimum <math>\omega_0 </math>, we can estimate the evidence by Taylor expanding <math>C(\omega; M) \approx C(\omega_0) + C′′(\omega_0)(\omega - \omega_0)^2/2</math>. This gives us<br />
<br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2}\\ &= exp \big\{- C(\omega_0)-\frac{1}{2}\ln(C (\omega_0)/\lambda) \big\}.\end{align*}<br />
<br />
The evidence is controlled by the value of the cost function at the minimum, and by the logarithm of the ratio of the curvature about this minimum compared to the regularization constant. In models with many parameters <br />
\begin{align*} P(y\mid x;M) &\approx e^{-C(\omega_0)}\sqrt{\frac{\lambda}{2\pi}} \int d \omega e^{-C′′(\omega_0)(\omega - \omega_0)^2/2} \\ &= exp \big\{- C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) \big\}.\end{align*}<br />
<br />
Occam’s factor arises from the log ratio <math>\ln (\lambda_i/\lambda) </math> The Occam factor describes the fraction of the prior parameter space consistent with the data. Occam’s factor penalizes the amount of information the model must learn about the parameters to accurately model the training data. Since the fraction is always less than one, the authors propose to approximate <math>P(y\mid x;M) </math> away from local minima by only performing the summation over eigenvalues <math>\lambda_i \geq \lambda </math>.<br />
<br />
The authors compare evidence against a null model which assumes the labels are entirely random. This model has no parameters, and so the evidence is controlled by the likelihood alone. <math>P(y\mid x;NULL) = (1/n)^N = e^{-N \ln(n)} </math>, where <math>n </math> denotes the number of model classes and <math>N</math> the number of training labels. The evidence ratio :<br />
\begin{equation*}\frac{P(y\mid x;M) }{P(y\mid x;NULL) } = e ^{-E(\omega_0)} \end{equation*}<br />
<math>E(\omega_0) = C(\omega_0)-\frac{1}{2} \sum_{i=1}^p \ln (\lambda_i/\lambda) - N\ln (n) </math> is the log evidence ratio in favor of the null model.<br />
The authors assign confidence to the predictions of a model iff <math>E(\omega_0 < 0 </math>.<br />
<br />
The evidence supports the intuition that broad minima generalize better than sharp minima, but unlike the curvature, it does not depend on the model parameterization. Dinh et al. (2017) showed one can increase the Hessian eigenvalues by rescaling the parameters, but they must simultaneously rescale the regularization coefficients, otherwise the model changes. Since Occam’s factor arises from the log ratio, <math>\ln (\lambda_i/\lambda) </math> , these two effects cancel out. Note however that while the evidence itself is invariant to model parameterization, one can find reparameterizations which change the approximate evidence after the Laplace approximation. It is difficult to evaluate the evidence for deep networks, as we cannot compute the Hessian of millions of parameters. Additionally, neural networks exhibit many equivalent minima, since we can permute the hidden units without changing the model. To compute the evidence we must carefully account for this “degeneracy”. The authors argue these issues are not a major limitation, since the intuition they build studying the evidence in simple cases will be sufficient to explain the results of both Zhang et al. (2016) and Keskar et al. (2016).<br />
<br />
==Bayes Theorem and Generalization==<br />
Zhang et al. (2016) showed that deep neural networks generalize well on training inputs with informative labels, but the same model can overfit on the same input images when the labels are randomized; perfectly memorizing the training set. To demonstrate that these observations are not unique to deep network, the authors use logistic regression. They form a small balanced training set comprising 800 images from MNIST, of which half have the true label “0” and half true label “1”. The test set is balanced, comprising 5000 MNIST images of zeros and 5000 MNIST images of ones. There are two tasks. In the first task, the labels of both the training and test sets are randomized. In the second task, the labels are informative, matching the true MNIST labels. The model has 784 weights and 1 bias.<br />
<br />
The accuracy of the model predictions on both the training and test sets is shown in figure 1. When trained on the informative labels, the model generalizes well to the test set, so long as it is weakly regularized. However the model also perfectly memorizes the random labels, replicating the observations of Zhang et al. (2016) in deep networks. No significant improvement in model performance is observed as the regularization coefficient increases. For completeness, we also evaluate the mean margin between training examples and the decision boundary. For both random and informative labels, the margin drops significantly as we reduce the regularization coefficient. When weakly regularized, the mean margin is roughly 50% larger for informative labels than for random labels.<br />
<br />
[[File:bg1.png|800px|thumb|center|]]<br />
<br />
Now consider figure 2, where we plot the mean cross-entropy of the model predictions, evaluated on both training and test sets, as well as the Bayesian log evidence ratio defined in the previous section. Looking first at the random label experiment in figure 2a, while the cross-entropy on the training set vanishes when the model is weakly regularized, the cross-entropy on the test set explodes. Not only does the model make random predictions, but it is extremely confident in those predictions. As the regularization coefficient is increased the test set cross-entropy falls, settling at <math>ln(2)</math>, the cross-entropy of assigning equal probability to both classes. Now consider the Bayesian evidence, which we evaluate on the training set. The log evidence ratio is large and positive when the model is weakly regularized, indicating that the model is exponentially less plausible than assigning equal probabilities to each class. As the regularization parameter is increased, the log evidence ratio falls, but it is always positive, indicating that the model can never be expected to generalize well.<br />
Now consider figure 2b (informative labels). Once again, the training cross-entropy falls to zero when the model is weakly regularized, while the test cross-entropy is high. Even though the model makes accurate predictions, those predictions are overconfident. As the regularization coefficient increases, the test cross-entropy falls below ln 2, indicating that the model is successfully generalizing to the test set. Now consider the Bayesian evidence. The log evidence ratio is large and positive when the model is weakly regularized, but as the regularization coefficient increases, the log evidence ratio drops below zero, indicating that the model is exponentially more plausible than assigning equal probabilities to each class. As we further increase the regularization, the log evidence ratio rises to zero while the test cross-entropy rises to <math>ln(2)</math>. Test cross-entropy and Bayesian evidence are strongly correlated, with minima at the same regularization strength.<br />
<br />
Bayesian model comparison has explained our results in a logistic regression. Meanwhile, Krueger et al. (2017) showed the largest Hessian eigenvalue also increased when training on random labels in deep networks, implying the evidence is falling. We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot. Dziugaite & Roy (2017) recently demonstrated the results of Zhang et al. (2016) can also be understood by minimising a PAC-Bayes generalization bound which penalizes sharp minima.<br />
[[File:bg2.png|800px|thumb|center|]]<br />
==Bayes Theorem and Stochastic Gradient Descent ==<br />
<br />
We showed above that generalization is strongly correlated with the Bayesian evidence, a weighted combination of the depth of a minimum (the cost function) and its breadth (the Occam factor). Consequently Bayesians often add isotropic Gaussian noise to the gradient (Welling & Teh, 2011). In appendix A, we show this drives the parameters towards broad minima whose evidence is large. The noise introduced by small batch training is not isotropic, and its covariance matrix is a function of the parameter values, but empirically Keskar et al. (2016) found it has similar effects, driving the SGD away from sharp minima. This paper therefore proposes Bayesian principles also account for the “generalization gap”, whereby the test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant). Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima, we expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.<br />
We were unable to observe a generalization gap in linear models (since linear models are convex there are no sharp minima to avoid). Instead we consider a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. We use SGD with a momentum parameter of 0.9. Unless otherwise stated, we use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. Furthermore, we train on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training. We emphasize that we are not trying to achieve optimal performance, but to study a simple model which shows a generalization gap between small and large batch training.<br />
In figure 3, we exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches. Now consider figure 3b. While the test cross-entropy for small batch training is lower at the end of training; the cross-entropy of both small and large training batches is increasing, indicative of over-fitting. Both models exhibit a minimum test cross-entropy, although after different numbers of gradient updates. Intriguingly, we show in appendix B that the generalization gap between small and large batch training shrinks significantly when we introduce L2 regularization.<br />
<br />
[[File:bg3.png|800px|thumb|center|]]<br />
<br />
From now on we focus on the test set accuracy (since this converges as the number of gradient updates increases). In figure 4a, we exhibit training curves for a range of batch sizes between 1 and 1000. We find that the model cannot train when the batch size <math>B \leq 10</math>. In figure 4b we plot the mean test set accuracy after 10,000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition. The results of Keskar et al. (2016) focused on the decay in test accuracy above this optimum batch size.<br />
[[File:bg4.png|800px|thumb|center|]]<br />
<br />
==Stochastic Differential Equations and Scaling Rules==<br />
The results showed above indicate that the test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant. It is argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics. The following content identifies this SGD “noise scale”, and uses it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size and momentum coefficient. <br />
First, interpret gradient update, as the discrete update of a stochastic differential equation <br />
\begin{equation*}\frac{d\omega}{dt} = \frac{dC}{d\omega} + \eta(t)\end{equation*}<br />
<math>\eta</math> represents noise <math>\langle \eta(t) \rangle = 0</math> and <math> \langle \eta (t)\eta (t')\rangle = gF (\omega)\delta (t-t')</math>.<br />
<math>t</math> is a continous variable, and <math>F(\omega)</math> matrix describing the gradient covariances.<br />
The SGD noise scale is taken to be <math>g \approx \epsilon N/B</math> where <math>\epsilon</math> is the learning rate, <math>N</math> training set size and <math>B</math> the batch size.<br />
[[File:bg5.png|800px|thumb|center|]]<br />
[[File:bg6.png|800px|thumb|center|]]<br />
[[File:bg7.png|800px|thumb|center|]]<br />
The noise scale falls when the batch B<br />
size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. A similar analysis of the SGD was recently performed by Mandt et al. (2017), although their treatment only holds near local minima where the covariances <math>F (ω)</math> are stationary. Our analysis holds throughout training, which is necessary since Keskar et al. (2016) found that the beneficial influence of noise was most pronounced at the start of training.<br />
When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies that <math>Bopt ∝ εN</math>. In figure 5a, we plot the test accuracy as a function of batch size after <math>(10000/ε)</math> training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as <math>ε</math> increases. Additionally, the peak test accuracy achieved at a given learning rate does not begin to fall until <math>ε ∼ 3</math>, indicating that there is no significant discretization error in integrating the stochastic differential equation below this point. Above this point, the discretization error begins to dominate and the peak test accuracy falls rapidly. In figure 5b, we plot the best observed batch size as a function of learning rate, observing a clear linear trend, <math>Bopt ∝ ε</math>. The error bars indicate the distance from the best observed batch size to the next batch size sampled in our experiments.<br />
<br />
This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times (Goyal et al., 2017). A similar scaling rule was independently proposed by Jastrzebski et al. (2017) and Chaudhari & Soatto (2017), although neither work identifies the existence of an optimal noise scale. A number of authors have proposed adjusting the batch size adaptively during training (Friedlander & Schmidt, 2012; Byrd et al., 2012; De et al., 2017), while Balles et al. (2016) proposed linearly coupling the learning rate and batch size within this framework. In Smith et al. (2017), we show empirically that decaying the learning rate during training and increasing the batch size during training are equivalent.<br />
In figure 6a we exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps (<math>ε = 1</math> everywhere). Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases. In figure 6b, we plot the best observed batch size as a function of training set size; observing another linear trend, <math>Bopt ∝ N</math>. This scaling rule could be applied to production models, progressively growing the batch size as new training data is collected. We expect production datasets to grow considerably over time, and consequently large batch training is likely to become increasingly common.<br />
<math>B(1−m)</math> scale of conventional SGD as <math>m → 0</math>. When <math>m > 0</math>, we obtain an additional scaling rule <math>Bopt ∝ 1/(1 − m)</math>. This scaling rule predicts that the optimal batch size will increase when the momentum coefficient is increased. In figure 7a we plot the test set performance as a function of batch size after 10000 gradient updates (<math>ε = 1</math> everywhere), for a range of momentum coefficients. In figure 7b, we plot the best observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement.<br />
<br />
==Critiques==<br />
<br />
#Bayesian statistics is not provably, at present, a theory that can be used to explain why a learning algorithm works. The Bayesian theory is too optimistic: we introduce a prior and model and then trust both implicitly. Relative to any particular prior and model (likelihood), the Bayesian posterior is the optimal summary of the data, but if either part is misspecified, then the Bayesian posterior carries no optimality guarantee. The prior is chosen for convenience here. <br />
#No discussions with respect to the analysis of information bottleneck which also discuss the generalization ability of the model. <br />
#No discussion on real online learning with streaming data where the total number of data points are unknown?<br />
#The paper presents how mini-batch noises with SGD can improve the performance of neural networks. However, the usefulness of the approach can be described and analyzed in greater details, if the author could provide the performance for various well-known real-life data.<br />
<br />
==Conclusion==<br />
<br />
The paper showed that mini-batch noise helps SGD to go away from sharp minima, and provided an evidence that there is an optimal optimum batch size for a maximum the test accuracy. Based on interpreting SGD as integrating stochastic differential equation, this batch size is proportional to the learning rate and the training set size. Moreover, the authors shown that <math>Bopt \propto 1/(1 − m) </math>, where <math>m</math> is the momentum coefficient. More analysis was done on the relation between the learning rate, effective learning rate, and batch size is presented in ICLR 2018, where the authors proved by experiments that all the benefits of decaying the learning rate are achieved by increasing the batch size in addition to reducing the number of parameter updates dramatically, and also were able use literature parameters without the need of any hyper parameter tuning (Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le).<br />
<br />
==References==<br />
<br />
#Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017.<br />
#Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.<br />
#Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012. <br />
#Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017.<br />
#Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
#Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
#Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.<br />
#Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.<br />
#Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
#Crispin W Gardiner. Handbook of Stochastic Methods, volume 4. Springer Berlin, 1985.<br />
#Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884– 1892, 2016.<br />
#Priya Goyal, Piotr Dolla ́r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
#Stephen F Gull. Bayesian inductive inference and maximum entropy. In Maximum-entropy and Bayesian methods in science and engineering, pp. 53–74. Springer, 1988.<br />
#Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM,1993.<br />
#Sepp Hochreiter and Ju ̈rgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
#Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.<br />
#Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.<br />
#Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.<br />
#Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
#David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via mem- orization. ICLR Workshop, 2017.<br />
#Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110, 2017.<br />
#David JC MacKay. A practical bayesian framework for backpropagation networks. Neural compu- tation, 4(3):448–472, 1992.<br />
#Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
#Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via informa- tion. arXiv preprint arXiv:1703.00810, 2017.<br />
#Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.<br />
#Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
#Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching&diff=42381Co-Teaching2018-12-11T00:28:54Z<p>Msminhas: Editorial</p>
<hr />
<div>=Introduction=<br />
==Title of Paper==<br />
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels<br />
==Contributions==<br />
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously, which focuses on training on selected clean instances and avoids estimating the noise transition matrix. In addition, using stochastic optimization with momentum to train the deep networks and clean data can be memorized by nonlinear deep networks, which becomes robust. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness of deep learning models trained by the co-teaching approach is much superior to state-of-the-art baselines<br />
<br />
==Terminology==<br />
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data. <br />
<br />
Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels. This can result in false positives.<br />
<br />
=Intuition=<br />
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually when learning on a batch of noisy data, only the error from the network itself is transferred back to facilitate learning. But in the case of Co-teaching, the two networks are able to filter different type of errors, and flow back to itself and the other network. As a result, the two models learn together, from the network itself and the partner network.<br />
<br />
=Motivation=<br />
The paper draws motivation from two key facts:<br />
• That many data collection processes yield noisy labels. <br />
• That deep neural networks have a high capacity to overfit to noisy labels. <br />
Because of these facts, it is challenging to train deep networks to be robust with noisy labels. <br />
=Related Works=<br />
<br />
1. Statistical learning methods: Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modelling. In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modelling category, there is a two coin model proposed to handle noise labels from multiple annotators. <br />
<br />
2. Deep learning methods: There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously. <br />
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators. <br />
<br />
3. Learning to teach methods: It is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks. Most works did not account for noisy labels, with exception to MentorNet, which applied the idea on data with noisy labels.<br />
<br />
=Co-Teaching Algorithm=<br />
<br />
[[File:Co-Teaching_Algorithm.png|600px|center]]<br />
<br />
The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch using mini-batch gradient descent, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network. <math>R(T)</math> governs the percentage of small-loss instances to be used in updating the parameters of each network.<br />
<br />
=Summary of Experiment=<br />
==Proposed Method==<br />
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network. <br />
[[File:Co-Teaching Fig 1.png|600px|center]] <br />
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate). <br />
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself. <br />
==Dataset Corruption==<br />
The datasets incorporated by this paper include MNIST, CIFAR-10 and CIFAR-100. A summary of these datasets are shown as below. <br />
<br />
[[File:co_teaching_data.png|600px|center]] <br />
<br />
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix<math>Q</math>, where where <math>Q_{ij} = Pr(\widetilde{y} = j|y = i)</math> given that noisy <math>\widetilde{y}</math> is flipped from clean <math>y</math>. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry. <br />
[[File:Co-Teaching Fig 2.png|600px|center]] <br />
Three noise conditions are simulated for comparing co-teaching with baseline methods.<br />
<br />
Note: Corruption of Dataset here means randomly choosing a wrong label instead of the target label by applying noise. <br />
<br />
{| class="wikitable"<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Method<br />
|width="100pt"|Noise Rate<br />
|width="700pt"|Rationale<br />
|-<br />
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels. <br />
|-<br />
| Symmetry || 50% || Half of the instances have noisy labels. Labels have a constant probability of being corrupted. Further rationale can be found at [1].<br />
|-<br />
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario. <br />
|}<br />
|}<br />
<br />
==Baseline Comparisons==<br />
The co-teaching method is compared with several baseline approaches, which have varying:<br />
• proficiency in dealing with a large number of classes,<br />
• ability to resist heavy noise,<br />
• need to combine with specific network architectures, and<br />
• need to be pretrained. <br />
<br />
[[File:Co-Teaching Fig 3.png|600px|center]] <br />
===Bootstrap===<br />
The general idea behind bootstrapping is to dynamically change (correct) noisy labels during training. The idea is to take a value derived from the original and predicted class. The final label is some convex combination of the two. It should be noted that the weighting of the prediction is increased over time to account for the model itself improving. Of course, this procedure needs to be finely tuned to prevent it from rampantly changing correct labels before it becomes accurate. [2].<br />
<br />
===S-Model===<br />
Using an additional softmax layer to model the noise transition matrix [3].<br />
===F-Correction===<br />
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].<br />
===Decoupling===<br />
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].<br />
===MentorNet===<br />
A mentor network weights the probability of data instances being clean/noisy in order to train the student network on cleaner instances [6].<br />
<br />
As shown in the above table - few of the advantages of Co-teaaching method include - Co-teaching<br />
method does not rely on any specific network architectures, which can also deal with a large number of classes and is more robust to noise. Besides, it can be trained from scratch. This makes teaching more appealing for practical usage.<br />
<br />
==Implementation Details==<br />
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization. <br />
<br />
[[File: Co-Teaching Table 3.png|center]] <br />
=Results and Discussion=<br />
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows. <br />
==MNIST==<br />
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations<br />
<br />
The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet. <br />
<br />
Robustness of the proposed method to noise which plays an important rule in the evaluation, is evident in the plots which is better or comparable to the other methods.<br />
<br />
[[File:Co-Teaching Table 4.png|550px|center]]<br />
<br />
[[File:Co-Teaching Graphs MNIST.PNG|center]]<br />
<br />
==CIFAR10==<br />
The observations here are consistently the same as these for MNIST dataset.<br />
[[File:Co-Teaching Table 5.png|550px|center]] <br />
<br />
[[File:Co-Teaching Graphs CIFAR10.PNG|center]]<br />
==CIFAR100==<br />
[[File:Co-Teaching Table 6.png|550px|center]] <br />
<br />
[[File: Co-Teaching Graphs CIFAR100.PNG|center]]<br />
==Choice of R(T) and <math> \tau</math>==<br />
There were some principles they followed when it came to choosing R(T) and <math> \tau</math>. R(T)=1, there was no instance needed at the beginning. They could safely update parameters in the early stage using the whole noise data since the deep neural networks would not memorize the noisy data. However, they need to drop more instances at the later stage. Because the model would eventually try to fit noisy data.<br />
<br />
R(T)=1-<math> \tau </math> *min{<math>T^{c}/T_{k},1 </math>} with <math> \tau=\epsilon </math>, where <math> \epsilon </math> is noise level.<br />
In this case, we consider c={0.5,1,2}. From Table 7, the test accuracy is stable.<br />
<br />
[[File: Co-Teaching Table 7.png|550px|center]] <br />
<br />
For <math> \tau</math>, we consider <math> \tau={0.5,0.75,1,1.25,1.5}\epsilon</math>. From Table 8, the performance can be improved with dropping more instances.<br />
[[File: Co-Teaching Table 8.png|550px|center]]<br />
<br />
=Conclusions=<br />
The main goal of the paper is to introduce the “Co-teaching” learning paradigm that uses two deep neural networks learning simultaneously to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depending on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).<br />
<br />
=Future Work=<br />
For future work, the paper can be extended in following ways: First , the the Co-teaching program can be adapted to train deep models under weak supervisions , e.g positive and unlabeled data. Second theoretical guarantees for Co-teaching can be investigated. The current approach seems to be have potential application in eliminating noisy labels/data from biomedical signals for example in the case of EEG data. This is important as EEG data are generally collected based on an experimental protocol and under controlled lab conditions. When data is collected in this way, even though the underlying brain process does not correspond to the EEG signals being collected, they can be labelled incorrectly based on the experimental protocol. Such cases of wrong labeling/data need to be eliminated from the training process and this is one scenario where co-teaching could possibly be applied. Also, this method seems to have potential application in data collected via crowd-sourcing or same data being labelled by multiple human subjects. Further, there is no analysis for generalization performance on deep learning with noisy labels which can also be studied in future.<br />
<br />
=Critique=<br />
The paper evaluates the performance considering the complexity of computations and implementations of the algorithms. Co-teaching methodology seems an interesting idea but can possibly become tricky to implement. Technically, such complexity can negatively impact the performance of the algorithm. <br />
==Lack of Task Diversity==<br />
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality. <br />
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==<br />
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm. <br />
==Lack of Theoretical Development (Mentioned in conclusion)==<br />
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.<br />
<br />
=References=<br />
[1] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The<br />
importance of being unhinged. In NIPS, 2015.<br />
<br />
[2] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural<br />
networks on noisy labels with bootstrapping. In ICLR, 2015.<br />
<br />
[3] J. Goldberger and E. Ben-Reuven. Training deep neural-networks using a noise adaptation layer.<br />
In ICLR, 2017.<br />
<br />
[4] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to<br />
label noise: A loss correction approach. In CVPR, 2017.<br />
<br />
[5] E. Malach and S. Shalev-Shwartz. Decoupling" when to update" from" how to update". In<br />
NIPS, 2017.<br />
<br />
[6] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum<br />
for very deep neural networks on corrupted labels. In ICML, 2018.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN&diff=42380Annotating Object Instances with a Polygon RNN2018-12-11T00:26:39Z<p>Msminhas: Editorial</p>
<hr />
<div>Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']<br />
<br />
The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].<br />
<br />
= Background =<br />
<br />
If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.<br />
<br />
Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):<br />
<br />
1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.<br />
<br />
2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.<br />
<br />
3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category. <br />
<br />
4. Instance Segmentation (''This paper performs this''): The goal is to not only to assign pixel-level categorical labels, but to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.<br />
<br />
[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]<br />
<br />
<br />
== Motivation ==<br />
<br />
Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.<br />
A polygon is natural form of annotation. Current instant segmentations annotated by humans use polygons because it is a special representation of the image which can use small number of vertices instead of various pixels and makes it easy to incorporate user modifications.<br />
<br />
[[File:polygon.png|600px|center]]<br />
<br />
== Goal ==<br />
<br />
Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming. <br />
<br />
{| class=wikitable width=700 align=center<br />
|Thus, the '''main goal''' of the paper is to enable '''semi-automatic''' annotation of object instances.<br />
|}<br />
<br />
Figure 2 demonstrates how the interface looks like for better clarity.<br />
<br />
Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods. This approach works as the silhouette of an object is typically connected without holes. <br />
<br />
{| class=wikitable width=900 align=center<br />
|Thus, the authors suggest to adopt this same technique to annotate images using polygons, except they plan to automate the method and replace/reduce manual labeling. The '''intuition''' behind the success of this method is the '''sparse''' nature of these polygons that allow annotating of an object through a cluster of pixels rather than classification at the pixel level.<br />
|}<br />
<br />
[[File:Annotating Object Instances Example.png | 450px|thumb|center|Figure 2: Given a bounding box, polygon outlining the object instance inside the box is predicted. This approach is designed to facilitation annotation, and easily incorporates user corrections of points to improve the overall object’s polygon. ]]<br />
<br />
<br />
= Related Works =<br />
<br />
Some of the techniques used in semi-automatic annotation are as follows:<br />
<br />
1. '''GrabCut''': In general, GrabCut is a method to separate the foreground and background of an image with minimal user interaction. Specifically, the user needs to only create a rectangular bounding box containing the foreground, and the algorithm will extract the object in the foreground. A major contribution of the paper is that labeling (of the object in the foreground) was not required, as the algorithm was able to identify where significant changes in colour pattern occurred. In this sense, it mimics automatic segmentation when combined with a Region Proposal Network. <br />
<br />
[[File:GrabCut_Example.png | 450px|thumb|center|Figure 3: Illustration of GrabCut.]]<br />
<br />
2. '''GrabCut + CNN''': Scribbles have also been used to train CNNs for semantic image segmentation. <br />
<br />
3. '''Superpixels''': Superpixels in the form of small polygons where the color intensity within each superpixel is similar, to a certain threshold, have been used to provide a sparse representation of large number of pixels in an image. However, the performance of this technique depends on the scale of the superpixels and hence sometimes merges small objects.<br />
<br />
[[File:Superpixel_idea.jpg | 450px|thumb|center|Figure 4: Illustration of the superpixel idea.]]<br />
<br />
= Model =<br />
<br />
As an '''input''' to the model, an annotator or perhaps another neural network provides a bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.<br />
<br />
The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.<br />
<br />
The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.<br />
<br />
== Architecture ==<br />
<br />
There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.<br />
<br />
[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 5: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type) ('''Note''': A possible point of confusion - the authors have only shown the layers of VGG16 architecture here that have the skip connections introduced).]]<br />
<br />
1. '''CNN with skip connections''':<br />
<br />
The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of features fused together in a tensor that can feed into the RNN (refer to Figure 5). Namely, the last max-pooling layer (''pool5'') present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passes through 2 pooling layers and 2 convolutional layers. Since, the features extracted after each operation are to be preserved and fused later on, at each of these four steps, the idea is to have a tensor with a common width of 512; so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features (helps to follow the object's boundaries) as well as boundary/semantic information about the instances (helps to identify the object). Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.<br />
<br />
2. '''RNN - 2 Layer ConvLSTM'''<br />
<br />
The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D received from CNN and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which<br />
concatenates 4 features: the CNN feature representation of the image, one-hot encoding of the previously predicted vertex and the vertex predicted<br />
from two-time steps ago, as well as the one-hot encoding of the first predicted vertex. <br />
<br />
The Convolutional LSTM computes the hidden state <math display = "inline">h_t</math> given the input <math display = "inline">x_t</math> based on the following equations:<br />
<center><br />
<math display="block"><br />
\begin{pmatrix}<br />
i_t \\<br />
f_t \\<br />
o_t \\<br />
g_t \\<br />
\end{pmatrix}<br />
= W_h * h_{t-1} + W_x * x_t + b<br />
</math><br />
<br />
<math display="block"><br />
c_t = \sigma(f_t) \bigodot c_{t-1} + \sigma(i_t) \bigodot tanh(g_t)<br />
</math><br />
<br />
<math display="block"><br />
h_t = \sigma(o_t) \bigodot tanh(c_t)<br />
</math><br />
</center><br />
where <math display = "inline">i, f, o</math> denote the input, forget, and output gate, <math display = "inline">h</math> is the hidden state and <math display = "inline">c</math> is the cell state. Also, <math display = "inline">\sigma</math> denotes the sigmoid function, <math display = "inline">\bigodot</math> indicates an element-wise product and <math display = "inline">*</math> a convolution. <math display = "inline">W_h</math> denotes the hidden-to-state convolution kernel and <math display = "inline">W_x</math> the input-to-state convolution kernel.<br />
<br />
The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, as polygon is a circle, any vertex of a polygon can be used as a starting point. Therefore the authors treat the starting point as special, and this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict vertices of the polygon. The boundaries and vertices prediction are being treated as binary classification problem in each cell in the output grid. This CNN is trained separately. Here, <math display = "inline">y_t</math> denotes the one-hot encoding of the vertex and is the output at time step <math>t</math>.<br />
<br />
== Training ==<br />
<br />
The training of the model is done as follows:<br />
<br />
1. Cross-entropy is used for the RNN loss function. To avoid over-penalizing of mispredictions that are close to the ground-truth vertex, non-zero probability mass is assigned to locations which are within a distance of 2 in D × D output grid.<br />
<br />
2. The typical training regime, where the model make predictions at each time step but feed in ground-truth vertex information to the next, is followed. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10) This choice of optimizer makes it easier for development, but switching back to SGD may get better experimental results due to convergence problems of Adam.<br />
<br />
3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function. In particular, the authors used the logistic loss for every location in the grid.<br />
<br />
The reported time for training is one day on a Nvidia Titan-X GPU.<br />
<br />
The resolution of the polygon is 28 x 28, based on the downsampling factor and ConvLSTM resolution. They simplified the polygon by removing vertices on the grid line and the same vertices that fall in the same grid. They also randomly flipped images, enlarged original bounding boxes and randomly selected the starting vertex of the polygon notation as their data augmentation process.<br />
<br />
== Importance of Human Annotator in the Loop ==<br />
<br />
The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". Note that this is only possible due to the adoption of the RNN architecture i.e. the inherent nature of the RNN to accept previous outputs allows incorporation of the user's judgement. The typical inference time as quoted by the paper is 250ms per object.<br />
<br />
= Results =<br />
<br />
== Evaluation Metrics ==<br />
<br />
The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. There are two metrics used for evaluation:<br />
<br />
1. '''IoU''': The standard Intersection over Union (IoU) measure is used for comparison. In add The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.<br />
<br />
An example of the IoU is illustrated in the figure below:<br />
<br />
[[File:IoU_figure.png|500px|center]] Source:https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/<br />
<br />
2. '''Number of Clicks''': To evaluate the speed up factor, the checkerboard distance is used to measure the distance between the ground truth (GT) and the output of the Polygon RNN. A set of distance thresholds are set <math display = "inline">T &isin; [1,2,3,4]</math> and if the distance exceeds the particular threshold, the correction is made by an annotator to match the GT and the '''Number of Clicks''' is used to evaluate the speed up factor.<br />
<br />
== Baseline Techniques ==<br />
<br />
1. '''SharpMask''': a 50 layer ResNet considered as the state of the art annotation method.<br />
<br />
2. '''DeepMask''': a build-up on the 50 layer ResNet with an addition of another CNN.<br />
<br />
3. '''Dilation10''': another simple technique using purely convolutional operations.<br />
<br />
4. '''SquareBox''': a simple technique where an entire bounding box is labeled as an object<br />
<br />
== Quantitative Results ==<br />
<br />
We report the IoU metric in Table<br />
1. The Polygon RNN method outperforms the baselines in 6 out of the 8 categories and has a mean IoU greater than all of the baselines. Particularly, in the car, person, and rider categories, a 12%, 7%, and 6% higher performance than SharpMask is achieved.<br />
<br />
[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]<br />
<br />
In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.<br />
<br />
[[File:Table_0_Neel.JPG | 800px|thumb|center|Table 2: IoU performance on Cityscapes data with annotator intervention.]]<br />
<br />
The method also works well with other datasets such as KITTI:<br />
<br />
[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 3: IoU performance on KITTI data.]]<br />
<br />
== Effect of object size ==<br />
In Fig. 4, we see how our model performs w.r.t baselines on different instance sizes. For small instances, our model performs significantly better than the baselines. For larger objects, the baselines have an advantage due to the larger output resolution. <br />
<br />
[[File:IoU_vs_size_of_instance.PNG | 500px|thumb|center|Fig 4: IoU_vs_size_of_instance.]]<br />
<br />
== Qualitative Results ==<br />
<br />
In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.<br />
<br />
<gallery widths=500px heights=500px perrow=2 mode="packed"><br />
File:Figure_3_Neel.JPG|Figure 6: Qualitative results: comparison with human annotator.|alt=alt language<br />
File:Figure_4_Neel.JPG|Figure 7: Qualitative results: comparison with human annotator.|alt=alt language<br />
</gallery><br />
<br />
=Conclusion=<br />
<br />
The important conclusions from this paper are:<br />
<br />
1. The paper presented a powerful generic annotation tool for modelling complex annotations as a simple polygon that works on different unseen datasets. <br />
<br />
2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself (speed-up factor of 4.74).<br />
<br />
3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.<br />
<br />
4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.<br />
<br />
5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.<br />
<br />
=Critique=<br />
<br />
1. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.<br />
<br />
2. Given that this model uses the VGG16 architecture compared to the 50 layer ResNet in SharpMask, this method is quite efficient.<br />
<br />
3. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.<br />
<br />
4. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.<br />
<br />
5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.<br />
<br />
6. Compared to other models, the model was shown to not perform as well for larger objects (see table 3). This is likely due to the fact that vertex location determination is done in a highly compressed (28x28) representation compared to the input image(224x224). For larger objects, bounding boxes are larger. Each vertex represents many pixels. When up-converted back to the input image/bounding box size these may lead to errors especially when considering a very precise evaluation metric (intersection over union) is used. Potentially, the results can be improved by considering a higher resolution for the internal representation or one that scales with the size of the bounding.<br />
<br />
7. While the model outperforms the baseline for certain categories of object, it is surprising that it underperforms in categories such as 'bus' and 'train'. With human annotators in the loop, one would expect the model to outperform in all categories.<br />
<br />
8. One of the major contributions of this paper lies on the fact that this paper presents a method that does have an applicable value in the real world. In the paper, it does show that it can greatly reduce the human labeling efforts, and with human collaboration, this algorithm can help us tackle the image labeling problem much more efficiently. However, it does not provide the theoretical explanation that why would an RNN work better than a CNN in this case, a more in-depth analysis would make the paper better.<br />
<br />
=Code=<br />
# [https://github.com/AlexMa011/pytorch-polygon-rnn] (unofficial)<br />
# Code for an updated version of the model is available at [https://github.com/fidler-lab/polyrnn-pp] (official)</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach&diff=42379Learning to Teach2018-12-11T00:22:32Z<p>Msminhas: Editorial</p>
<hr />
<div><br />
<br />
=Introduction=<br />
<br />
This paper proposed the "learning to teach" (L2T) framework with two intelligent agents: a student model/agent, corresponding to the learner in traditional machine learning algorithms, and a teacher model/agent, determining the appropriate data, loss function, and hypothesis space to facilitate the learning of the student model.<br />
<br />
In modern human society, the role of teaching is heavily implicated in our education system; the goal is to equip students with the necessary knowledge and skills in an efficient manner. This is the fundamental ''student'' and ''teacher'' framework on which education stands. However, in the field of artificial intelligence (AI) and specifically machine learning, researchers have focused most of their efforts on the ''student'' (ie. designing various optimization algorithms to enhance the learning ability of intelligent agents). The paper argues that a formal study on the role of ‘teaching’ in AI is required. Analogous to teaching in human society, the teaching framework can: select training data that corresponds to the appropriate teaching materials (e.g. textbooks selected for the right difficulty), design loss functions that correspond to targeted examinations and define the hypothesis space that corresponds to imparting the proper methodologies. Furthermore, an optimization framework (instead of heuristics) should be used to update the teaching skills based on the feedback from students, so as to achieve teacher-student co-evolution.<br />
<br />
Thus, the training phase of L2T would have several episodes of interactions between the teacher and the student model. Based on the state information in each step, the teacher model would update the teaching actions so that the student model could perform better on the Machine Learning problem. The student model would then provide reward signals back to the teacher model. These reward signals are used by the teacher model as part of the Reinforcement Learning process to update its parameters. In this paper policy gradient algorithm is incorporated. This process is end-to-end trainable and the authors are convinced that once converged, the teacher model could be applied to new learning scenarios and even new students, without extra efforts on re-training.<br />
<br />
To demonstrate the practical value of the proposed approach, the '''training data scheduling''' problem is chosen as an example. The authors show that by using the proposed method to adaptively select the most<br />
suitable training data, they can significantly improve the accuracy and convergence speed of various neural networks including multi-layer perceptron (MLP), convolutional neural networks (CNNs)<br />
and recurrent neural networks (RNNs), for different applications including image classification and text understanding.<br />
Furthermore , the teacher model obtained by the paper from one task can be smoothly transferred to other tasks. As an example, the teacher model trained on MNIST with the MLP learner, one can achieve a satisfactory performance on CIFAR-10 only using roughly half<br />
of the training data to train a ResNet model as the student.<br />
<br />
=Related Work=<br />
The L2T framework connects with two emerging trends in machine learning. The first is the movement from simple to advanced learning. This includes meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012) which explores automatic learning by transferring learned knowledge from meta tasks [1]. This approach has been applied to few-shot learning scenarios and in designing general optimizers and neural network architectures. (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li & Malik, 2016; Zoph & Le, 2017)<br />
<br />
The second is the teaching, which can be classified into either machine-teaching (Zhu, 2015) [2] or hardness based methods. The former seeks to construct a minimal training set for the student to learn a target model (ie. an oracle). The latter assumes an order of data from easy instances to hard ones, hardness being determined in different ways, is beneficial to the learning process. In curriculum learning (CL) (Bengio et al, 2009; Spitkovsky et al. 2010; Tsvetkov et al, 2016) [3] measures hardness through heuristics based understanding of the data while self-paced learning (SPL) (Kumar et al., 2010; Lee & Grauman, 2011; Jiang et al., 2014; Supancic & Ramanan, 2013) [4] measures hardness by loss on data. Another teaching method called 'pedagogical teaching' with applications in inverse reinforcement learning is quite close, in setup, to the method being proposed in this paper. The similarity can be observed in the way the teacher adjusts its behaviour to facilitate student learning and in the way the teacher communicates with the student. <br />
<br />
The limitations of these works include the lack of a formal definition of the teaching problem whereas a learning problem has been a formal mathematical definition. This makes it difficult to differentiate between teaching and learning problems. Other limitations are the reliance on heuristics and fixed rules, which hinders generalization of the teaching task.<br />
<br />
=Learning to Teach=<br />
To introduce the problem and framework, without loss of generality, consider the setting of supervised learning.<br />
<br />
In supervised learning, each sample <math>x</math> is from a fixed but unknown distribution <math>P(x)</math>, and the corresponding label <math> y </math> is from a fixed but unknown distribution <math>P(y|x) </math>. The goal is to find a function <math>f_\omega(x)</math> with parameter vector <math>\omega</math> that minimizes the gap between the predicted label and the actual label.<br />
<br />
<br />
<br />
==Problem Definition==<br />
In supervised learning, the goal is to choose a function <math display="inline">f_w(x)</math> with <math display="inline">w</math> as the parameter vector to predict the supervisor's label as good as possible. The goodness of a function <math display="inline">f_w</math> is evaluated by the risk function: <br />
<br />
\begin{align*}R(w) = \int M(y, f_w(x))dP(x,y)\end{align*}<br />
<br />
where <math display="inline">\mathcal{M}(,)</math> is the metric which evaluates the gap between the label and the prediction.<br />
<br />
The student model, denoted &mu;(), takes the set of training data <math> D </math>, the function class <math> Ω </math>, and loss function <math> L </math> as input to output a function, <math> f(ω) </math>, with parameter <math>ω^*</math> which minimizes risk <math>R(ω)</math> as in:<br />
<br />
\begin{align*}<br />
ω^* = arg min_{w \in \Omega} \sum_{x,y \in D} L(y, f_ω(x)) =: \mu (D, L, \Omega)<br />
\end{align*}<br />
<br />
The teaching model, denoted φ, tries to provide <math> D </math>, <math> L </math>, and <math> Ω </math> (or any combination, denoted <math> A </math>) to the student model such that the student model either achieves lower risk R(ω) or progresses as fast as possible.<br />
In contrast to traditional machine learning, which is only concerned with the student model in the<br />
learning to teach framework, the problem in the paper is also concerned with a teacher model, which tries to provide<br />
appropriate inputs to the student model so that it can achieve low risk functional as efficiently<br />
as possible.<br />
<br />
<br />
::'''Training Data''': Outputting a good training set <math> D \in \mathcal{D} </math>, where <math>\mathcal{D}</math> is the Borel set on the input space and label space. This is analogous to human teachers providing students with proper learning materials such as textbooks. <br />
::'''Loss Function''': Designing a good loss function <math> L \in \mathcal{L} </math>, where <math>\mathcal{L}</math> is the set of all possible loss functions. This is analogous to providing useful assessment criteria for students.<br />
::'''Hypothesis Space''': Defining a good function class <math> Ω \in \mathcal{W}</math>, where <math>\mathcal{W}</math> is the set of possible hypothesis spaces, which the student model can select from. This is analogous to human teachers providing appropriate context, eg. middle school students taught math with basic algebra while undergraduate students are taught with calculus. Different Ω leads to different errors and optimization problem (Mohri et al., 2012).<br />
<br />
==Framework==<br />
The training phase consists of the teacher providing the student with the subset <math> A_{train} </math> of <math> A </math> and then taking feedback to improve its own parameters.After the convergence of the training process,<br />
the teacher model can be used to teach either<br />
new student models, or the same student<br />
models in new learning scenarios such as another<br />
subset <math> A_{test} </math>is provided. Such a generalization is feasible as long as the state representations<br />
S are the same across different student<br />
models and different scenarios. The L2T process is outlined in the figure below:<br />
<br />
[[File: L2T_process.png | 500px|center]]<br />
<br />
* <math> s_t &isin; S </math> represents information available to the teacher model at time <math> t </math>. <math> s_t </math> is typically constructed from the current student model <math> f_{t−1} </math> and the past teaching history of the teacher model. <math> S </math> represents the set of states.<br />
* <math> a_t &isin; A </math> represents action taken the teacher model at time <math> t </math>, given state <math>s_t</math>. <math> A </math> represents the set of actions, where the action(s) can be any combination of teaching tasks involving the training data, loss function, and hypothesis space. <br />
* <math> φ_θ : S → A </math> is policy used by the teacher model to generate its action <math> φ_θ(s_t) = a_t </math><br />
* Student model takes <math> a_t </math> as input and outputs function <math> f_t </math>, by using the conventional ML techniques.<br />
<br />
Mathematically, taking data teaching as an example where <math>L</math> <math>/Omega</math> as fixed, the objective of the teacher in the L2T framework is <br />
<br />
<center> <math>\max\limits_{\theta}{\sum\limits_{t}{r_t}} = \max\limits_{\theta}{\sum\limits_{t}{r(f_t)}} = \max\limits_{\theta}{\sum\limits_{t}{r(\mu(\phi_{\theta}(s_t), L, \Omega))}}</math> </center><br />
<br />
Once the training process converges, the teacher model may be utilized to teach a different subset of <math> A </math> or teach a different student model.<br />
<br />
=Application=<br />
<br />
There are different approaches to training the teacher model, this paper will apply reinforcement learning with <math> φ_θ </math> being the ''policy'' that interacts with <math> S </math>, the ''environment''. The paper applies data teaching to train a deep neural network student, <math> f </math>, for several classification tasks. Thus the student feedback measure will be classification accuracy. Its learning rule will be mini-batch stochastic gradient descent, where batches of data will arrive sequentially in random order. The teacher model is responsible for providing the training data, which in this case means it must determine which instances (subset) of the mini-batch of data will be fed to the student. In order to reach the convergence faster, the reward was set to relate to the speed the student model learns. <br />
<br />
The authors also designed a state feature vector <math> g(s) </math> in order to efficiently represent the current states which include arrived training data and the student model. Within the State Features, there are three categories including Data features, student model features and the combination of both data and learner model. This state feature will be computed when each mini-batch of data arrives.<br />
<br />
Data features contain information for data instance, such as its label category, (for texts) the length of sentence, linguistic features for text segments (Tsvetkov et al., 2016), or (for images) gradients histogram features (Dalal & Triggs, 2005).<br />
<br />
Student model features include the signals reflecting how well current neural network is trained. The authors collect several simple features, such as passed mini-batch number (i.e., iteration), the average historical training loss and historical validation accuracy.<br />
<br />
Some additional features are collected to represent the combination of both data and learner model. By using these features, the authors aim to represent how important the arrived training data is for current leaner. The authors mainly use three parts of such signals in our classification tasks: 1) the predicted probabilities of each class; 2) the loss value on that data, which appears frequently in self-paced learning (Kumar et al., 2010; Jiang et al., 2014a; Sachan & Xing, 2016); 3) the margin value.<br />
<br />
The optimizer for training the teacher model is the maximum expected reward: <br />
<br />
\begin{align} <br />
J(θ) = E_{φ_θ(a|s)}[R(s,a)]<br />
\end{align}<br />
<br />
Which is non-differentiable w.r.t. <math> θ </math>, thus a likelihood ratio policy gradient algorithm is used to optimize <math> J(θ) </math> (Williams, 1992) [4]. The estimation is based on the gradient <math>\nabla_{\theta} = \sum_{t=1}^{T}E_{\phi_{\theta}}(a_t|s_t)[\nabla_{\theta}log(\phi_{\theta}(a_t|s_t))R(s_t, a_t)]</math>, which is empirically estimated as <math>\sum_{t=1}^{T} \nabla_{\theta}log(\phi_{\theta}(a_t|s_t))v_t</math>. <math>v_t</math> is defined as the sampled estimation of reward <math>R(s_t, a_t)</math> from one execution of the policy. Given that the reward is just the terminal reward, we have <math>\nabla_{\theta} = \sum_{t=1}^{T} \nabla_{\theta}log(\phi_{\theta}(a_t|s_t))r_T</math><br />
<br />
==Experiments==<br />
<br />
The L2T framework is tested on the following student models: multi-layer perceptron (MLP), ResNet (CNN), and Long-Short-Term-Memory network (RNN). <br />
<br />
The student tasks are Image classification for MNIST, for CIFAR-10, and sentiment classification for IMDB movie review dataset. <br />
<br />
The strategy will be benchmarked against the following teaching strategies:<br />
<br />
::'''NoTeach''': NoTeach removes the entire Teacher-Student paradigm and reverts back to the classical machine learning paradigm. In the context of data teaching, we consider the architecture fixed, and feed data in a pre-determined way. One would pre-define batch-size and cross-validation procedures as needed.<br />
::'''Self-Paced Learning (SPL)''': Teaching by ''hardness'' of data, defined as the loss. This strategy begins by filtering out data with larger loss value to train the student with "easy" data and gradually increases the hardness. Mathematically speaking, those training data <math>d </math> satisfying loss value <math>l(d) > \eta </math> will be filtered out, where the threshold <math> \eta </math> grows from smaller to larger during the training process. To improve the robustness of SPL, following the widely used trick in common SPL implementation (Jiang et al., 2014b), the authors filter training data using its loss rank in one mini-batch rather than the absolute loss value: they filter data instances with top <math>K </math>largest training loss values within a <math>M</math>-sized mini-batch, where <math>K</math> linearly drops from <math>M − 1 </math>to 0 during training.<br />
<br />
::'''L2T''': The Learning to Teach framework.<br />
::'''RandTeach''': Randomly filter data in each epoch according to the logged ratio of filtered data instances per epoch (as opposed to deliberate and dynamic filtering by L2T).<br />
<br />
For all teaching strategies, they make sure that the base neural network model will not be updated until <math>M </math> un-trained, yet selected data instances are accumulated. That is to guarantee that the convergence speed is only determined by the quality of taught data, not by different model updating frequencies. The model is implemented with Theano and run on one NVIDIA Tesla K40 GPU for each training/testing process.<br />
===Training a New Student===<br />
<br />
In the first set of experiments, the datasets or divided into two folds. The first folder is used to train the teacher; This is done by having the teacher train a student network on that half of the data, with a certain portion being used for computing rewards. After training, the teacher parameters are fixed and used to train a new student network (with the same structure) on the second half of the dataset. When teaching a new student with the same model architecture, we observe that L2T achieves significantly faster convergence than other strategies across all tasks, especially compared to the NoTeach and RandTeach methods:<br />
<br />
[[File: L2T_speed.png | 1100px|center]]<br />
<br />
===Filtration Number===<br />
<br />
When investigating the details of filtered data instances per epoch, for the two image classification tasks, the L2T teacher filters an increasing amount of data as training goes on. The authors' intuition for the two image classification tasks is that the student model can learn from harder instances of data from the beginning, and thus the teacher can filter redundant data. In contrast, for training while for the natural language task, the student model must first learn from easy data instances.<br />
<br />
[[File: L2T_fig3.png | 1100px|center]]<br />
<br />
===Teaching New Student with Different Model Architecture===<br />
<br />
In this part, first a teacher model is trained by interacting with a student model. Then using the teacher model, another student model<br />
which has a different model architecture is taught.<br />
The results of Applying the teacher trained on ResNet32 to teach other architectures is shown below. The L2T algorithm can be seen to obtain higher accuracies earlier than the SPL, RandTeach, or NoTeach algorithms.<br />
<br />
[[File: L2T_fig4.png | 1100px|center]]<br />
<br />
===Training Time Analysis===<br />
<br />
The learning curves demonstrate the efficiency in accuracy achieved by the L2T over the other strategies. This is especially evident during the earlier training stages.<br />
<br />
[[File: L2T_fig5.png | 600px|center]]<br />
<br />
===Accuracy Improvement===<br />
<br />
When comparing training accuracy on the IMDB sentiment classification task, L2T improves on teaching policy over NoTeach and SPL.<br />
<br />
[[File: L2T_t1.png | 500px|center]]<br />
<br />
Table 1 shows that we boost the convergence speed, while the teacher model improves final accuracy. The student model is the LSTM network trained on IMDB. Prior to teaching the student model, we train the teacher model on half of the training data, and define the terminal reward as the set accuracy after the teacher model trains the student for 15 epochs. Then the teacher model is applied to train the student model on the full dataset till its convergence. The state features are kept the same as those in previous experiments. We can see that L2T achieves better classification accuracy for training LSTM network, surpassing the SPL baseline by more than 0.6 point (with p value < 0.001).<br />
<br />
=Future Work=<br />
<br />
There is some useful future work that can be extended from this work: <br />
<br />
1) Recent advances in multi-agent reinforcement learning could be tried on the Reinforcement Learning problem formulation of this paper. <br />
<br />
2) Some human in the loop architectures like CHAT and HAT (https://www.ijcai.org/proceedings/2017/0422.pdf) should give better results for the same framework. <br />
<br />
3) It would be interesting to try out the framework suggested in this paper (L2T) in Imperfect information and partially observable settings. <br />
<br />
4) As they have focused on data teaching exploring loss function teaching would be interesting.<br />
<br />
=Critique=<br />
<br />
While the conceptual framework of L2T is sound, the paper only experimentally demonstrates efficacy for ''data teaching'' which would seem to be the simplest to implement. The feasibility and effectiveness of teaching the loss function and hypothesis space are not explored in a real-world scenario. Also, this paper does not provide enough mathematical foundation to prove that this model can be generalized to other datasets and other general problems. The method presented here where the teacher model filters data does not seem to provide enough action space for the teacher model. Furthermore, the experimental results for data teaching suggest that the speed of convergence is the main improvement over other teaching strategies whereas the difference in accuracy less remarkable. The paper also assesses accuracy only by comparing L2T with NoTeach and SPL on the IMDB classification task, the improvement (or lack thereof) on the other classification tasks and teaching strategies is omitted. Again, this distinction is not possible to assess in loss function or hypothesis space teaching within the scope of this paper. They could have included larger datasets such as ImageNet and CIFAR100 in their experiments which would have provided some more insight.<br />
<br />
Also, teaching should not be limited to data, loss function and hypothesis space. In a human teacher-student model, the teaching contents are concepts and logical rules, similar to weights of hidden layers in neural networks. How to transfer such knowledge is interesting to investigate.<br />
<br />
The idea of having a generalizable teacher model to enhance student learning is admirable. In fact, the L2T framework is similar to the reinforcement learning actor-critic model, which is known to be effective. In general, one expects an effective teacher model would facilitate transfer learning and can significantly reduce student model training time. However, the T2L framework seems to fall short of that goal. Consider the CIFAR10 training scenario, the L2T model achieve 85% accuracy after 2 million training data, which is only about 3% more accuracy than a no-teacher model. Perhaps in the future, the L2T framework can improve and produce better performance.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding&diff=42378stat946w18/Towards Image Understanding From Deep Compression Without Decoding2018-12-11T00:18:19Z<p>Msminhas: Editorial</p>
<hr />
<div>Paper Title: Towards Image Understanding from Deep Compression Without Decoding - ICLR 2018<br />
<br />
Presented By: Aravind Ravi<br />
<br />
== Introduction ==<br />
Recent advances in the deep neural network (DNN) based image compression methods have shown potential improvements in image quality, savings in storage and bandwidth reduction. These methods leverage common neural network architectures such as convolutional autoencoders or recurrent neural networks to compress and reconstruct RGB images and outperform classical techniques such as JPEG2000 and BPG on perceptual metrics such as structural similarity index (SSIM) and multi-scale structural similarity index (MS-SSIM).<br />
<br />
These approaches encode an image <math>x </math> to some feature map (compressed representation), which is subsequently quantized to a set of symbols <math>z </math>. These symbols are then losslessly compressed to a bitstream, from which a decoder reconstructs an image <math>{\hat{x}} </math>, of the same dimensions as <math>x </math>.<br />
<br />
Learned compression algorithms have an advantage over engineering compression algorithms in that they can be much more easily adapted to specific domains. For example, a learned compression algorithm might be able to learn a good performance in compressing medical images, without specifically tuning the algorithm.<br />
<br />
In this paper, the authors explore the idea of applying the learned representations to perform inference without reconstructing the compressed image. Specifically, instead of reconstructing an RGB image from the compressed representation and feeding it to a network for inference, the paper proposes to use a modified network that bypasses reconstruction of the RGB image.<br />
<br />
The rationale behind this approach is that the neural network architectures commonly used for learned compression (in particular the encoders) are similar to the ones commonly used for inference, and learned image encoders are hence, in principle, capable of extracting features relevant for inference tasks. The encoder might learn features relevant for inference purely by training on the compression task and can be forced to learn these features by training on the compression and inference tasks jointly.<br />
<br />
The advantage of learning an encoder for image compression which produces compressed representation containing features relevant for inference is obvious in scenarios where images are transmitted (e.g. from a mobile device) before processing (e.g. in the cloud), as it saves reconstruction of the RGB image as well as part of the feature extraction and hence speeds up processing. A typical use case is a cloud photo storage application where every image is processed immediately upon upload for indexing and search purposes.<br />
<br />
Note: [https://en.wikipedia.org/wiki/Structural_similarity More Information on SSIM, MSSIM]<br />
<br />
== Intuition ==<br />
<br />
Compression techniques (something as common as zipping) are commonly used by us in day to day file handling tasks. Most often we use engineered compression techniques. Deep Neural Networks (DNNs) are nonlinear function approximators which act as feature extractors, extracting features from inputs (like images or sound files). These can be seen as learning based compression techniques as they can perform compression and they can be trained using back propagation as well. If image classification can be done on these compressed files, large image data sets like hyperspectral images and MRI images can be stored efficiently and the compressed files can be used directly by the DNNs for classification or reinforcement learning tasks.<br />
<br />
==Motivation and Contributions==<br />
The authors propose to perform image understanding tasks such as image classification and segmentation directly on DNN based compressed representations. Performing the image understanding tasks on the compressed representations/encoded feature maps has two advantages. <br />
# This method bypasses the process of decoding the image into the RGB space before classification.<br />
# The authors show that it reduces the overall computational complexity up to 2 times.<br />
<br />
=== Contributions of the Paper ===<br />
* A method to perform image classification and semantic segmentation from compressed representations. In large scale image understanding problems, learning from a compressed representation is definitely something that is interesting. <br />
* The proposed method offers classification accuracy similar to that achieved on decompressed images while reducing the computational complexity by 2 times.<br />
* Semantic segmentation has been shown to be as accurate as performance on decompressed images for moderate compression rates and higher accuracy for aggressive compression rates. In addition, this method achieves lower computational complexity.<br />
* Joint training for image compression and classification has been shown to improve the quality of the image and increase in accuracy of classification and segmentation.<br />
<br />
==Related Work==<br />
<br />
The prior work has shown image classification from compressed images based on engineered codecs. Some of the works in this area are:<br />
<br />
* In video analysis domain: Action recognition (Yeo et al., 2008; Kantorov & Laptev, 2014)<br />
* Classification of compressed hyperspectral images (Hahn et al., 2014; Aghagolzadeh & Radha, 2015)<br />
* Discrete Cosine Transform based compression performed on images before feeding into a neural network, which shows an improvement in training speed by up to 10 times Fu & Guimaraes (2016)<br />
* Video analysis on compressed video (using engineered codecs) has also been studied in the past (Babu et al., 2016)<br />
* Criticism on document image analysis methods (Javed et al.2017)<br />
<br />
The authors propose a method that does inference on top of learned feature representation and hence has a direct relation to unsupervised feature learning using autoencoders.<br />
They also claim that so far there hasn't been any work using learned compressed representations for image classification and segmentation.<br />
<br />
==Learned Deeply Compressed Representations==<br />
<br />
The image compression task is performed based on a convolutional autoencoder architecture proposed by Theis et al. 2017 (shown in the figure below), and a variant of the training procedure described by Agustsson et. al 2017. <br />
<br />
[[File:AR_theisAutoencoder.png|600px|center]]<br />
<br />
Some points to better understand the architecture:<br />
<br />
1. Most convolutions are done in a convolved, lower-dimensional space to speed up computation<br />
<br />
2. Different activation functions are used. Blank arrows indicate the identity function (no additional linearity), while black arrows indicate leaky rectifications<br />
<br />
3. The “round” box simply rounds all elements in the tensor to the nearest integer<br />
<br />
4. The “subpix” block is just an upsampling /reconstruction block where the feature map’s coefficients are reshuffled after a convolution<br />
<br />
<br />
<br />
=== Compression Architecture ===<br />
<br />
The compression network is an autoencoder that takes an input image <math>x </math> and outputs <math>{\hat{x}} </math> as the approximation to the input. <br />
<br />
[[File:AR_Fig2a.png|300px|center]]<br />
<br />
The encoder has the following structure: It starts with 2 convolutional layers with spatial subsampling by a factor of 2, followed by 3 residual units, and a final convolutional layer with spatial subsampling by a factor of 2. This results in a <math>w/8</math> x <math>h/8</math> x <math>C</math> dimensional representation, where <math>w </math> and <math>h </math> are the spatial dimensions of <math>x </math>, and the number of channels C is a hyperparameter related to the rate <math>R </math>. This representation is then quantized to a discrete set of symbols, forming a compressed representation, <math>z </math>.<br />
<br />
To get the reconstruction <math>{\hat{x}} </math>, the compressed representation is fed into the decoder, which mirrors the encoder, but uses upsampling and deconvolutions instead of subsampling and convolutions.<br />
<br />
Quantizing the compressed representation imposes a distortion <math>D </math> on <math>{\hat{x}} </math> w.r.t. <math>x </math>, i.e., it increases the reconstruction error. This is traded for a decrease in entropy of the quantized compressed representation<br />
<math>z </math> which leads to a decrease of the length of the bitstream as measured by the rate <math>R </math>. Thus, to train the image compression network, the classical rate-distortion trade-off <math>D + \beta R</math> is minimized. As a metric for <math>D </math>, the mean squared error (MSE) between <math>x </math> and <math>{\hat{x}} </math> are used and <math>R</math> is estimated using<br />
<math>H(q)</math>. <math>H(q)</math> is the entropy of the probability distribution over the symbols and is estimated using a histogram of the probability distribution (as done by Agustsson et al., 2017). The trade-off between MSE and the entropy is controlled by adjusting <math>\beta </math>. For each <math>\beta </math> an operating point is derived where the images have a certain bit rate, as measured by bits per pixel (bpp), and corresponding MSE. To better control the bpp, a target entropy Ht is introduced by the authors to formulate the loss defined as:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)<br />
\end{align}<br />
<br />
Agustsson et. al 2017, proposed a method to overcome the issue of non-differentiability of the quantization step by proposing a differentiable approximation to the quantization. This method has been adapted to suit the current application in the paper.<br />
<br />
Three operating points at 0.0983 bpp (C=8), 0.330 bpp (C=16), and 0.635 bpp (C=32) are obtained empirically. All further experiments are performed with these three operating points and the results for the same are presented in the following sections.<br />
<br />
==Image Classification from Compressed Representations==<br />
<br />
=== Classification on RGB Images ===<br />
<br />
For the image classification task based on the RGB images, the authors use the ResNet-50 architecture. <br />
Further information on residual networks can be found in the following links: <br />
[https://youtu.be/K0uoBKBQ1gA ResNets Part-1]<br />
[https://youtu.be/GSsKdtoatm8 ResNets Part-2]<br />
<br />
The details of the architecture are presented in the table below:<br />
<br />
[[File:AR_Tab1.png|400px|center]]<br />
<br />
In this paper, the number of 14x14 (conv4_x) blocks have been modified to obtain a new architecture called ResNet-71. <br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
For input images with spatial dimension 224x224, the encoder of the compression network outputs a compressed representation with dimensions 28x28xC, where C is the number of channels. To use this compressed representation as input to the classification network, a simple variant of the ResNet architecture is proposed. This variant is referred to as cResNet-k, where c stands for “compressed representation” and k is the<br />
number of convolutional layers in the network. These networks are constructed by simply “cutting off” the front of the regular (RGB) ResNet. The root-block of the network and the residual layers that have a larger spatial dimension than 28x28 are removed. To adjust the number of layers k, the ResNet architecture proposed by He et al. (2015) is used and the number of 14x14 (conv4 x) residual blocks are modified.<br />
<br />
In this way, three different architectures are derived:<br />
* cResNet-39 is ResNet-50 with the first 11 layers removed as described above, and this significantly reduces computational cost<br />
* cResNet-51<br />
* cResNet-72<br />
<br />
cResNet-51 and cResNet-72 are obtained by adding 14x14 residual blocks to match the computational cost of ResNet-50 and ResNet-71 respectively.<br />
<br />
The detailed description of all the network architectures are presented below:<br />
<br />
[[File:AR_Tab3.png|600px|center]]<br />
<br />
==Semantic Segmentation from Compressed Representations==<br />
<br />
For semantic segmentation, the ResNet based DeepLab architecture is adapted for the proposed application. The cResNet<br />
and ResNet image classification architectures are re-purposed with atrous<br />
convolutions, where the filters are upsampled instead of downsampling the feature maps. This is<br />
done to increase their receptive field and to prevent aggressive subsampling of the feature maps. For segmentation, the ResNet architecture is restructured such<br />
that the output feature map has 8 times smaller spatial dimension than the original RGB image (instead<br />
subsampling by a factor 32 times like for classification). When using the cResNets the output feature<br />
map has the same spatial dimensions as the input compressed representation (instead of subsampling<br />
4 times like for classification). This results in comparably sized feature maps for both the compressed<br />
representation and the reconstructed RGB images. Finally the last 1000-way classification layer of<br />
these classification architectures is replaced by an atrous spatial pyramid pooling (ASPP) with four<br />
parallel branches with rates {6, 12, 18, 24}, which provides the final pixel-wise classification.<br />
<br />
==Joint Training for Compression and Image Classification==<br />
<br />
The authors propose a joint training strategy to combine compression and classification tasks. To do this, the proposed method combines the compression network and the cResNet-51 architecture. The figure below shows the combined pipeline:<br />
<br />
[[File:AR_Fig2b.png|300px|center]]<br />
<br />
All parts, encoder, decoder, and inference network, are trained at the same time. The compressed representation is fed<br />
to the decoder to optimize for mean-squared reconstruction error and to a cResNet-51 network to<br />
optimize for classification using a cross-entropy loss. The combined loss function takes the form:<br />
<br />
\begin{align}<br />
\mathcal{L_c} = \gamma(\text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0))+l_{ce}(y,{\hat{y}})<br />
\end{align}<br />
<br />
where the loss terms for the compression network, <math> \mathcal{L_c} = \text{MSE}(x,{\hat{x}})+\beta\max({H(q)}-{H_t},0)</math>, are the same as in training for compression only. <math> l_{ce}</math> is the cross-entropy loss for classification.<br />
<math>\gamma </math> controls the trade-off between the compression loss and the classification loss.<br />
<br />
==Experiments and Results==<br />
<br />
=== Learned Deeply Compressed Representations Results ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset.<br />
<br />
The metrics used to measure the compression quality are as follows: <br />
* PSNR (Peak Signal-to-Noise Ratio) is a standard measure, depending monotonically on mean squared error defined as: <br />
<br />
\begin{align}<br />
PSNR = 10(\log_{10}(255^2/MSE))<br />
\end{align}<br />
<br />
* SSIM (Structural Similarity Index) and MS-SSIM (Multi-Scale SSIM) are metrics proposed to measure the similarity of images as perceived by humans<br />
<br />
The figure below depicts the performance of the deep compression models vs. standard JPEG and JPEG2000. Higher values are better. The proposed technique outperforms the JPEG and JPEC2000 at the operating points used in this paper.<br />
<br />
[[File:AR_Fig8.png|600px|center]]<br />
<br />
The learned compressed representations are illustrated in the figure below. <br />
<br />
[[File:AR_Fig9.png|500px|center]]<br />
<br />
In the above figure, the original RGB-image is shown along with compressed versions of the RGB image which are reconstructed from the compressed representations. The 4 channels with the highest entropy are shown in the visualizations. These visualizations indicate how the networks compress an image, as the rate (bpp) gets lower the entropy cost of the network forces the<br />
compressed representation to use fewer quantization levels, as can clearly be seen. For the most aggressive compression, the channel maps use only 2 levels for the compressed representation.<br />
<br />
=== Classification on Compressed Representations ===<br />
<br />
All experiments have been performed on the ILSVRC2012 dataset. It consists of 1.28 million training images and 50k validation images. These images are distributed across 1000 diverse classes. For image classification, the top-1 classification accuracy and top-5 classification accuracy are reported on the validation set on 224x224 center crops for RGB images and 28x28 center crops for the compressed representation.<br />
<br />
==== Training Procedure ====<br />
<br />
The compression network is fixed while training the classification network, both when training with compressed representations and with reconstructed compressed RGB images. For the compressed representations, the output of the fixed encoder (the compressed representation) is provided input to the cResNets (decoder is not needed). When training on the reconstructed compressed RGB images, the output of the fixed encoder-decoder (RGB image) is provided as input to the ResNet. This is done for each operating point.<br />
<br />
Refer to Appendix A Section A4, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Classification Results ====<br />
<br />
The tables below present the results of the classification at each operating point, both classifying from the compressed representation and the corresponding reconstructed compressed RGB images.<br />
<br />
[[File:AR_Tab2.png|400|center]]<br />
<br />
Figure below shows the validation curves for ResNet-50, cResNet-51, and cResNet-39. <br />
<br />
[[File:AR_Fig3.png|700|center]]<br />
<br />
For the 2 classification architectures with the same computational complexity (ResNet-50 and cResNet-51), the validation curves at the 0.635 bpp compression operating point almost coincide, with ResNet-50 performing slightly better. As the rate (bpp) gets smaller this performance gap gets smaller. The table above shows the<br />
classification results when the different architectures have converged. At the 0.635 bpp operating point, ResNet-50 only performs 0.5% better in top-5 accuracy than cResNet-51, while for the 0.0983 bpp operating point this difference is only 0.3%.<br />
Using the same pre-processing and the same learning rate schedule but starting from the original uncompressed RGB images yields 89.96% top-5 accuracy. The top-5 accuracy obtained from the compressed representation at the 0.635 bpp compression operating point, 87.85%, is even competitive<br />
with that obtained for the original images at a significantly lower storage cost. Specifically, at 0.635 bpp the ImageNet dataset requires 24.8 GB of storage space instead of 144 GB for the original version, a reduction by a factor 5.8 times.<br />
<br />
Notes on top-1 and top-5 accuracy:<br />
<br />
* Top-1 accuracy: This is the conventional accuracy metric used in machine learning. Wherein if the true label of the input to a model matches the highest probability class of the last layer of the output of CNN (predicted class probability), then the given input is correctly classified, else it is considered as incorrectly classified.<br />
* Top-5 accuracy: In this case, if any of the model's 5 highest classification probabilities match with the true label of the input, then this is considered as a correct classification, else it is an incorrect classification.<br />
<br />
===Semantic Segmentation Results===<br />
<br />
All experiments have been performed on the PASCAL VOC-2012 dataset for semantic segmentation. It has 20 object foreground classes and 1 background class. The dataset<br />
consists of 1464 training and 1449 validation images. In every image, each pixel is annotated with<br />
one of the 20 + 1 classes. The original dataset is furthermore augmented with extra annotations, so the final dataset has 10,582 images for training and 1449 images for validation.<br />
<br />
All performance is measured on pixel-wise intersection-over-union (IoU) averaged over all the classes or mean-intersection-over-union (mIoU) on the validation set. <br />
<br />
[https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ Details on IoU]<br />
<br />
==== Training Procedure ====<br />
The cResNet/ResNet networks are pre-trained on the ImageNet dataset using the procedure described earlier on the image classification task, the encoder and decoder is fixed as in the earlier scenario. The architectures are then adapted with dilated convolutions, cResNet-d/ResNet-d, and<br />
finetuned on the semantic segmentation task.<br />
<br />
Refer to Appendix A Section A5, of the paper for details on the hyperparameters and optimization used for training the network [1].<br />
<br />
==== Segmentation Results ====<br />
<br />
The table below shows the mIoU results for the segmentation task.<br />
<br />
[[File:AR_Tab2.png|450|center]]<br />
<br />
The figure below illustrates the segmentation results with respect to each compression operating point.<br />
<br />
[[File:AR_Fig4.png|700|center]]<br />
<br />
For semantic segmentation ResNet-50-d and cResNet-51-d perform equally well at the 0.635 bpp compression operating point. For the<br />
0.330 bpp operating point, segmentation from the compressed representation performs slightly better, 0.37%, and at the 0.0983 bpp operating point segmentation from the compressed representation<br />
performs considerably better than for the reconstructed compressed RGB images, by 1.65%.<br />
<br />
[[File:AR_Fig5.png|600px|center]]<br />
<br />
The above figure shows the predicted segmentation visually for both the cResNet-51-d and the ResNet-50-d<br />
architecture at each operating point. Along with the segmentation, it also shows the original uncompressed<br />
RGB image and the reconstructed compressed RGB image. These images highlight<br />
the challenging nature of these segmentation tasks, but they can nevertheless be performed using the<br />
compressed representation. They also clearly indicate that the compression affects the segmentation,<br />
as lowering the rate (bpp) progressively removes details in the image. Comparing the segmentation<br />
from the reconstructed RGB images to the segmentation from the compressed representation visually,<br />
the performance is similar.<br />
<br />
The figure below is another example of visual results of segmentation from compressed representation and reconstructed RGB<br />
images. The performance is visually similar for all operating points except for the 0.0983<br />
bpp operating point where the reconstructed RGB image fails to capture the back part of<br />
the train, while the compressed representation manages to capture that aspect of the image in the<br />
segmentation.<br />
<br />
[[File:AR_Fig10.png|600px|center]]<br />
<br />
=== Results on Computational Gains ===<br />
<br />
[[File:AR_Fig6.png|400px|center]]<br />
<br />
=====Computational Gains on Classification=====<br />
<br />
The figure on the left illustrates, the top-5 classification accuracy as a function of computational<br />
complexity for the 0.0983 bpp compression operating point.<br />
Looking at a fixed computational cost, the reconstructed compressed RGB images perform about 0.25% better. Looking at a fixed classification cost, inference from the compressed representation costs about <math>0.6 * 10^9</math> FLOPs more. However when accounting for the decoding cost at a fixed<br />
classification performance, inference from the reconstructed compressed RGB images costs <math>2.2*10^9</math> FLOPs more than inference from the compressed representation.<br />
<br />
=====Computational Gains on Segmentation=====<br />
<br />
In the figure on the right illustrates, the mIoU validation performance is shown as a function of computational complexity for<br />
the 0.0983 bpp compression operating point. <br />
Here, even without accounting for the decoding cost of the reconstructed images, the compressed representation<br />
performs better. At a fixed computational cost, segmentation from the compressed representation gives about 0.7% better mIoU. And at a fixed mIoU the computational cost is about <math>3.3*10^9</math> FLOPs<br />
lower for compressed representations. Accounting for the decoding costs this difference becomes <math>6.1*10^9</math> FLOPs. due to the nature of the dilated convolutions and the increased feature map size, the<br />
relative computational gains for segmentation are not as pronounced as for classification.<br />
<br />
===Joint Training for Compression and Image Classification===<br />
<br />
==== Training Procedure ====<br />
<br />
When doing joint training, the compression network and the classification networks are first initialized<br />
from a trained state obtained as described previously. After initialization, the networks are<br />
both finetuned jointly. For a detailed<br />
description of hyperparameters used and the training schedule see Appendix A8 in the paper.<br />
<br />
To control that the change in classification accuracy is not only due to (1) a better compression<br />
operating point or (2) the fact that the cResNet is trained longer, the following is done. A new operating point is obtained by finetuning the compression network only using the schedule described<br />
above. The cResNet-51 is trained on top of this new operating point from scratch. Finally, the compression network is fixed at the new operating point, and the cResNet-51 is trained for 9 epochs. <br />
<br />
To obtain segmentation results, the jointly trained network is used. The operating point is fixed and the jointly finetuned classification network is adopted fro segmentation (cResNet-51-d).<br />
<br />
==== Joint Training Results ====<br />
<br />
[[File:AR_Fig7.png|400px|center]]<br />
<br />
It can be seen from the figure, that the classification and segmentation results “move<br />
up” from the baseline through fine tuning. When training jointly the improvement for classification are larger and<br />
a significant improvement for segmentation is achieved. For the 0.635 bpp operating point the classification performance is similar for training the network jointly and training<br />
the compression network only, but when using these operating points for segmentation the difference is considerable.<br />
<br />
The results presented by the authors suggest an improvement in classification by 2%, a performance gain which would<br />
require an additional 75% of the computational complexity of cResNet-51. The segmentation<br />
performance after training the networks jointly is 1.7% better in mIoU than training only<br />
the compression network.<br />
<br />
==Critique==<br />
<br />
The paper proposes how previous work in auto-encoders and image compression can be extended effectively to a novel task of a combined image compression and recognition task. The work has provided extensive experimental evaluation and evidence that suggests that learned compressed representations can be effective in classification and segmentation tasks. While maintaining the performance of the techniques to state of the art performance, the authors show that the proposed method can offer significant computational gains. The applications of this can be in<br />
multimedia communication, wireless transmission of images, video surveillance on the mobile edge, etc. With the advent of 5G and other new wireless technologies, this method offers capabilities that can be utilized to conserve wireless bandwidth, savings on storage while retaining the perceptual quality of images.<br />
The joint training of compression and classification network provides some added advantages and also shows that at aggressive compression rates the performance in classification and segmentation can be improved significantly.<br />
<br />
The authors mention that the complexity of the current approach is still high in comparison with methods like JPEG or JPEG2000. They also mention that this can be overcome when the networks are trained and run on GPU's. Although this has been seen as a drawback, with subsequent improvements in physical hardware and more specialized deep learning platforms, the limitation of the current approach can be overcome. While the authors did thorough experiments and gave extensive results on compressed representations and their advantages, the idea itself is not very novel. Finally, in the light of providing extensive experimental contributions,<br />
the authors have written a quite lengthy paper. There are parts of the paper where the ideas have been repeated frequently, and this could've been avoided leading to a more well-balanced length of the article.<br />
<br />
* ([[https://openreview.net/forum?id=HkXWCMbRW]]) As it is mentioned in the paper, solving a Vision problem directly from a compressed image, is not a novel method (e.g: DCT coefficients were used for both vision and audio data to solve a task without any decompression).<br />
<br />
==Conclusion==<br />
<br />
The paper proposes an inference task using compressed image representations without the need to decode for classification and semantic segmentation. The paper has successfully demonstrated through a set of rigorous experiments the approach<br />
for performing the intended tasks. The results show significant improvements in computational complexity while maintaining state of the art classification and segmentation performance. The authors also intend to explore other computer vision tasks based on using compressed representation as part of the future work. They also suggest that this could potentially lead to gaining a better understanding of the features/compressed representations learned by image compression networks leading to applications in unsupervised or semi-supervised learning.<br />
<br />
==References==<br />
# Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., & Van Gool, L. (2018). Towards image understanding from deep compression without decoding. arXiv preprint arXiv:1803.06131.<br />
# Theis, L., Shi, W., Cunningham, A., & Huszár, F. (2017). Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395.<br />
# Agustsson, E., Mentzer, F., Tschannen, M., Cavigelli, L., Timofte, R., Benini, L., & Gool, L. V. (2017). Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems (pp. 1141-1151).<br />
# He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
# Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks&diff=42377stat946w18/Wavelet Pooling For Convolutional Neural Networks2018-12-11T00:15:09Z<p>Msminhas: Editorial</p>
<hr />
<div>=Wavelet Pooling For Convolutional Neural Networks=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
== Introduction, Important Terms and Brief Summary==<br />
<br />
This paper focuses on the following important techniques: <br />
<br />
1) Convolutional Neural Nets (CNN): These are networks with layered structures that conform to the shape of inputs rather than vector-based features and consistently obtain high accuracies in the classification of images and objects. Researchers continue to focus on CNN to improve their performances. <br />
<br />
2) Pooling: Pooling subsamples the results of the convolution layers and gradually reduces spatial dimensions of the data throughout the network. It is done to reduce parameters, increase computational efficiency and regulate overfitting. <br />
<br />
Some of the pooling methods, including max pooling and average pooling, are deterministic. Deterministic pooling methods are efficient and simple, but can hinder the potential for optimal network learning. In contrast, mixed pooling and stochastic pooling use a probabilistic approach, which can address some problems of deterministic methods. The neighborhood approach is used in all the mentioned pooling methods due to its simplicity and efficiency. Nevertheless, the approach can cause edge halos, blurring, and aliasing which need to be minimized. This paper introduces wavelet pooling, which uses a second-level wavelet decomposition to subsample features. The nearest neighbor interpolation is replaced by an organic, subband method that more accurately represents the feature contents with fewer artifacts. The method decomposes features into a second level decomposition and discards first level subbands to reduce feature dimensions. This method is compared to other state-of-the-art pooling methods to demonstrate superior results. Tests are conducted on benchmark classification tests like MNIST, CIFAR10, SHVN and KDEF.<br />
<br />
For further information on wavelets, follow this link to MathWorks' [https://www.mathworks.com/videos/understanding-wavelets-part-1-what-are-wavelets-121279.html Understanding Wavelets] video series.<br />
<br />
== Intuition ==<br />
<br />
Convolutional networks commonly employ convolutional layers to extract features and use pooling methods for spatial dimensionality reduction. In this study, wavelet pooling is introduced as an alternative to traditional neighborhood pooling by providing a more structural feature dimension reduction method. Max pooling is addressed to have over-fitting problems and average pooling is mentioned to smooth out or 'dilute' details in features.<br />
<br />
Pooling is often introduced within networks to ensure local invariance to prevent overfitting due to small transitional shifts within an image. Despite the effectiveness of traditional pooling methods such as max pooling introduce this translational invariance by discarding information using methods analogous to nearest neighbour interpolation. With the hope of providing a more organic way of pooling, the authors leverage all information within cells inputted within a pooling operation with the hope that the resulting dim-reduced features are able to contain information from all high level cells using various dot products.<br />
<br />
== History ==<br />
<br />
A history of different pooling methods have been introduced and referenced in this study:<br />
* Manual subsampling at 1979<br />
* Max pooling at 1992<br />
* Mixed pooling at 2014<br />
* Pooling methods with probabilistic approaches at 2014 and 2015<br />
<br />
== Background ==<br />
Average Pooling and Max Pooling are well-known pooling methods and are popular techniques used in the literature. These pooling methods reduce input data dimensionality by taking the maximum value or the average value of specific areas and condense them into one single value. While these methods are simple and effective, they still have some limitations. The authors identify the following limitations:<br />
<br />
'''Limitations of Max Pooling and Average Pooling'''<br />
<br />
'''Max pooling''': takes the maximum value of a region <math>R_{ij} </math> and selects it to obtain a condensed feature map. It can '''erase the details''' of the image (happens if the main details have less intensity than the insignificant details) and also commonly '''over-fits''' the training data. The max-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = max_{(p,q)\in R_{ij}}(a_{kpq})<br />
\end{align}<br />
<br />
'''Average pooling''': calculates the average value of a region and selects it to obtain a condensed feature map. Depending on the data, this method can '''dilute pertinent details''' from an image (happens for data with values much lower than the significant details) The avg-pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>a_{kij}</math> is the output activation of the <math>k^{th}</math> feature map at <math>(i,j)</math>, <math>a_{kpq}</math> is the input activation at<br />
<math>(p,q)</math> within <math>R_{ij}</math>, and <math>|R_{ij}|</math> is the size of the pooling region. Figure 1 gives a quick visual example of max and average pooling:<br />
<br />
[[File: pooling.png| 700px|center]]<br />
<br />
Figure 2 provides an example of the weaknesses of these two methods using toy images:<br />
<br />
[[File: fig0001.PNG| 700px|center]]<br />
<br />
<br />
'''How the researchers try to '''combat these issues'''?'''<br />
Using '''probabilistic pooling methods''' such as:<br />
<br />
1. '''Mixed pooling''': In general, when facing a new problem in which one would want to use a CNN, it is not intuitively known whether average or max-pooling should be preferred. Notably, both techniques have significant drawbacks. Average pooling forces the network to consider low magnitude (and possibly irrelevant information) in constructing representations, while max pooling can force the network to ignore fundamental differences between neighboring groups of pixels. To counteract this, mixed pooling probabilistically decides which to use during training / testing. It should be noted that, for training, it is only probabilistic in the forward pass. During back-propagation the network defaults to the earlier chosen method. Mixed pooling can be applied in 3 different ways.<br />
<br />
* For all features within a layer<br />
* Mixed between features within a layer<br />
* Mixed between regions for different features within a layer<br />
<br />
Mixed Pooling is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = \lambda \cdot max_{(p,q)\in R_{ij}}(a_{kpq})+(1-\lambda) \cdot \frac{1}{|R_{ij}|}\sum_{(p,q)\in R_{ij}}{{a_{kpq}}}<br />
\end{align}<br />
<br />
Where <math>\lambda</math> is a random value 0 or 1, indicating max or average pooling for a particular region/feature/layer.<br />
<br />
2. '''Stochastic pooling''': improves upon max pooling by randomly sampling from neighborhood regions based on the probability values of each activation. This is defined as:<br />
<br />
\begin{align}<br />
a_{kij} = a_l ~ \text{where } ~ l\sim P(p_1,p_2,...,p_{|R_{ij}|})<br />
\end{align}<br />
<br />
with probability of activations within each region defined as follows:<br />
<br />
\begin{align}<br />
p_{pq} = \dfrac{a_{pq}}{\sum_{(p,q)} \in R_{ij} a_{pq}}<br />
\end{align}<br />
<br />
The figure below describes the process of Stochastic Pooling. The figure on the left shows the activations of a given region, and the corresponding probability is shown in the center. The activations with the highest probability is selected by the pooling method. However, any activation can be selected. In this case, the midrange activation of 13% is selected. <br />
<br />
[[File: stochastic pooling.jpeg| 700px|center]]<br />
<br />
As stochastic pooling is based on probability and is not deterministic, it avoids the shortcomings of max and average pooling and enjoys some of the advantages of max pooling.<br />
<br />
3. "Top-k activation pooling" is the method that picks the top-k activation in every pooling region. This makes sure that the maximum information can pass through subsampling gates. It is to be used with max pooling, but after max pooling, to further improve the representation capability, they pick top-k activation, sum them up, and constrain the summation by a constant. <br />
Details in this paper: https://www.hindawi.com/journals/wcmc/2018/8196906/<br />
<br />
'''Wavelets and Wavelet Transform'''<br />
A wavelet is a representation of some signal. For use in wavelet transforms, they are generally represented as combinations of basis signal functions.<br />
<br />
The wavelet transform involves taking the inner product of a signal (in this case, the image), with these basis functions. This produces a set of coefficients for the signal. These coefficients can then be quantized and coded in order to compress the image.<br />
<br />
One issue of note is that wavelets offer a tradeoff between resolution in frequency, or in time (or presumably, image location). For example, a sine wave will be useful to detect signals with its own frequency, but cannot detect where along the sine wave this alignment of signals is occurring. Thus, basis functions must be chosen with this tradeoff in mind.<br />
<br />
Source: Compressing still and moving images with wavelets<br />
<br />
The following images show the result of applying a wavelet transform to an image for denoising:<br />
<br />
[[File: Noise Wavelet.jpg| 700px]] [[File: Denoised Wavelet.jpg| 700px]]<br />
<br />
images were taken from [https://en.wikipedia.org/wiki/Discrete_wavelet_transform#Example_in_Image_Processing here].<br />
<br />
== Proposed Method ==<br />
<br />
The previously highlighted pooling methods use neighborhoods to subsample, almost identical to nearest neighbor interpolation.<br />
<br />
The proposed pooling method uses wavelets (i.e. small waves - generally used in signal processing) to reduce the dimensions of the feature maps. They use wavelet transform to minimize artifacts resulting from neighborhood reduction. They postulate that their approach, which discards the first-order sub-bands, more organically captures the data compression. The authors say that this organic reduction, therefore, lessens the creation of jagged edges and other artifacts that may impede correct image classification.<br />
<br />
* '''Forward Propagation'''<br />
<br />
The proposed wavelet pooling scheme pools features by performing a 2nd order decomposition in the wavelet domain according to the fast wavelet transform (FWT) which is a more efficient implementation of the two-dimensional discrete wavelet transform (DWT) as follows:<br />
<br />
\begin{align}<br />
W_{\varphi}[j+1,k] = h_{\varphi}[-n]*W_{\varphi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
\begin{align}<br />
W_{\psi}[j+1,k] = h_{\psi}[-n]*W_{\psi}[j,n]|_{n=2k,k\leq0}<br />
\end{align}<br />
<br />
where <math>\varphi</math> is the approximation function, and <math>\psi</math> is the detail function, <math>W_{\varphi},W_{\psi}</math> are called approximation and detail coefficients. <math>h_{\varphi[-n]}</math> and <math>h_{\psi[-n]}</math> are the time reversed scaling and wavelet vectors, (n) represents the sample in the vector, while (j) denotes the resolution level<br />
<br />
When using the FWT on images, it is applied twice (once on the rows, then again on the columns). By doing this in combination, the detail sub-bands (LH, HL, HH) at each decomposition level, and approximation sub-band (LL) for the highest decomposition level is obtained.<br />
After performing the 2nd order decomposition, the image features are reconstructed, but only using the 2nd order wavelet sub-bands. This method pools the image features by a factor of 2 using the inverse FWT (IFWT) which is based on the inverse DWT (IDWT).<br />
<br />
\begin{align}<br />
W_{\varphi}[j,k] = h_{\varphi}[-n]*W_{\varphi}[j+1,n]+h_{\psi}[-n]*W_{\psi}[j+1,n]|_{n=\frac{k}{2},k\leq0}<br />
\end{align}<br />
<br />
[[File: wavelet pooling forward.PNG| 700px|center]]<br />
<br />
<br />
* '''Backpropagation'''<br />
<br />
The proposed wavelet pooling algorithm performs backpropagation by reversing the process of its forward propagation. First, the image feature being backpropagated undergoes 1st order wavelet decomposition. After decomposition, the detail coefficient sub-bands up-sample by a factor of 2 to create a new 1st level decomposition. The initial decomposition then becomes the 2nd level decomposition. Finally, this new 2nd order wavelet decomposition reconstructs the image feature for further backpropagation using the IDWT. Figure 5, illustrates the wavelet pooling backpropagation algorithm in details:<br />
<br />
[[File:wavelet pooling backpropagation.PNG| 700px|center]]<br />
<br />
== Results and Discussion ==<br />
<br />
All experiments have been performed using the MatConvNet(Vedaldi & Lenc, 2015) architecture. Stochastic gradient descent has been used for training. For the proposed method, the Haar wavelet has been chosen as the basis wavelet for its property of having even, square sub-bands. All CNN structures except for MNIST use a network loosely based on Zeilers network (Zeiler & Fergus, 2013). The experiments are repeated with Dropout (Srivastava, 2013) and the Local Response Normalization (Krizhevsky, 2009) is replaced with Batch Normalization (Ioffe & Szegedy, 2015) for CIFAR-10 and SHVN (Dropout only) to examine how these regularization techniques change the pooling results. The authors have tested the proposed method on four different datasets as shown in the figure:<br />
<br />
[[File: selection of image datasets.PNG| 700px|center]]<br />
<br />
Different methods based on Max, Average, Mixed, Stochastic and Wavelet have been used at the pooling section of each architecture. Accuracy and Model Energy have been used as the metrics to evaluate the performance of the proposed methods. These have been evaluated and their performances have been compared on different data-sets.<br />
<br />
* MNIST:<br />
<br />
The network architecture is based on the example MNIST structure from MatConvNet, with batch-normalization, inserted. All other parameters are the same. The figure below shows their network structure for the MNIST experiments.<br />
<br />
[[File: CNN MNIST.PNG| 700px|center]]<br />
<br />
The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The table below shows their proposed method outperforms all methods. Given the small number of epochs, max pooling is the only method to start to over-fit the data during training. Mixed and stochastic pooling show a rocky trajectory but do not over-fit. Average and wavelet pooling show a smoother descent in learning and error reduction. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: MNIST pooling method energy.PNG| 700px|center]]<br />
<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
<br />
[[File: MNIST perf.PNG| 700px|center]]<br />
<br />
* CIFAR-10:<br />
<br />
The authors perform two sets of experiments with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization. The second uses dropout and batch normalization and performs over 30 more epochs to observe the effects of these changes. <br />
<br />
[[File: CNN CIFAR.PNG| 700px|center]]<br />
<br />
The input training and test data come from the CIFAR-10 dataset. <br />
The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. For both cases, with no dropout, and with dropout, Tables below show that the proposed method has the second highest accuracy.<br />
<br />
[[File: fig0000.jpg| 700px|center]]<br />
<br />
Max pooling over-fits fairly quickly, while wavelet pooling resists over-fitting. The change in learning rate prevents their method from over-fitting, and it continues to show a slower propensity for learning. Mixed and stochastic pooling maintains a consistent progression of learning, and their validation sets trend at a similar, but better rate than their proposed method. Average pooling shows the smoothest descent in learning and error reduction, especially in the validation set. The energy of each method per epoch is also shown below:<br />
<br />
[[File: CIFAR_pooling_method_energy.PNG| 700px|center]]<br />
<br />
<br />
* SHVN:<br />
<br />
Two sets of experiments are performed with the pooling methods. The first is a regular network structure with no dropout layers. They use this network to observe each pooling method without extra regularization same as what happened in the previous datasets.<br />
The second network uses dropout to observe the effects of this change. The figure below shows their network structure for the SHVN experiments:<br />
<br />
[[File: CNN SHVN.PNG| 700px|center]]<br />
<br />
The input training and test data come from the SHVN dataset. For the case with no dropout, they use 55,000 images from the training set. For the case with dropout, they use the full training set of 73,257 images, a validation set of 30,000 images they extract from the extra training set of 531,131 images, as well as the full testing set of 26,032 images. For both cases, with no dropout, and with dropout, Tables below show their proposed method has the second lowest accuracy.<br />
<br />
[[File: SHVN perf.PNG| 700px|center]]<br />
<br />
Max and wavelet pooling both slightly over-fit the data. Their method follows the path of max pooling but performs slightly better in maintaining some stability. Mixed, stochastic, and average pooling maintain a slow progression of learning, and their validation sets trend at near identical rates. The figure below shows the energy of each method per epoch.<br />
<br />
[[File: SHVN pooling method energy.PNG| 700px|center]]<br />
<br />
* KDEF:<br />
<br />
They run one set of experiments with the pooling methods that includes dropout. The figure below shows their network structure for the KDEF experiments:<br />
<br />
[[File:CNN KDEF.PNG| 700px|center]]<br />
<br />
The input training and test data come from the KDEF dataset. This dataset contains 4,900 images of 35 people displaying seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) using facial expressions. They display emotions at five poses (full left and right profiles, half left and right profiles, and straight).<br />
<br />
This dataset contains a few errors that they have fixed (missing or corrupted images, uncropped images, etc.). All of the missing images are at angles of -90, -45, 45, or 90 degrees. They fix the missing and corrupt images by mirroring their counterparts in MATLAB and adding them back to the dataset. They manually crop the images that need to match the dimensions set by the creators (762 x 562).<br />
KDEF does not designate a training or test data set. They shuffle the data and separate 3,900 images as training data, and 1,000 images as test data. They resize the images to 128x128 because of memory and time constraints.<br />
<br />
The dropout layers regulate the network and maintain stability in spite of some pooling methods known to over-fit. The table below shows their proposed method has the second highest accuracy. Max pooling eventually over-fits, while wavelet pooling resists over-fitting. Average and mixed pooling resist over-fitting but are unstable for most of the learning. Stochastic pooling maintains a consistent progression of learning. Wavelet pooling also follows a smoother, consistent progression of learning.<br />
The figure below shows the energy of each method per epoch.<br />
<br />
[[File: KDEF pooling method energy.PNG| 700px|center]]<br />
<br />
The accuracies for both paradigms are shown below:<br />
<br />
[[File: KDEF perf.PNG| 700px|center]]<br />
<br />
<br />
<br />
* Computational Complexity:<br />
Above experiments and implementations on wavelet pooling were more of a proof-of-concept rather than an optimized method. In terms of mathematical operations, the wavelet pooling method is the least computationally efficient compared to all other pooling methods mentioned above. Among all the methods, average pooling is the most efficient methods, max pooling and mix pooling are at a similar level while wavelet pooling is way more expensive to complete the calculation.<br />
<br />
== Conclusion ==<br />
<br />
They prove wavelet pooling has the potential to equal or eclipse some of the traditional methods currently utilized in CNNs. Their proposed method outperforms all others in the MNIST dataset, outperforms all but one in the CIFAR-10 and KDEF datasets, and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset. The addition of dropout and batch normalization show their proposed methods response to network regularization. Like the non-dropout cases, it outperforms all but one in both the CIFAR-10 & KDEF datasets and performs within respectable ranges of the pooling methods that outdo it in the SHVN dataset.<br />
<br />
The authors' results confirm previous studies proving that no one pooling method is superior, but some perform better than others depending on the dataset and network structure Boureau et al. (2010); Lee et al. (2016). Furthermore, many networks alternate between different pooling methods to maximize the effectiveness of each method. [1]<br />
<br />
Future work and improvements in this area could be to vary the wavelet basis to explore which basis performs best for the pooling. Altering the upsampling and downsampling factors in the decomposition and reconstruction can lead to better image feature reductions outside of the 2x2 scale. Retention of the subbands we discard for the backpropagation could lead to higher accuracies and fewer errors. Improving the method of FTW we use could greatly increase computational efficiency. Finally, analyzing the structural similarity (SSIM) of wavelet pooling versus other methods could further prove the vitality of using the authors' approach. [1]<br />
<br />
== Suggested Future work ==<br />
<br />
Upsampling and downsampling factors in decomposition and reconstruction need to be changed to achieve more feature reduction.<br />
The subbands that we previously discard should be kept for higher accuracies. To achieve higher computational efficiency, improving the FTW method is needed.<br />
<br />
== Critiques and Suggestions ==<br />
*The functionality of backpropagation process which can be a positive point of the study is not described enough comparing to the existing methods.<br />
* The main study is on wavelet decomposition while the reason of using Haar as mother wavelet and the number of decomposition levels selection has not been described and are just mentioned as a future study! <br />
* At the beginning, the study mentions that the pooling method is not under attention as it should be. In the end, results show that choosing the pooling method depends on the dataset and they mention trial and test as a reasonable approach to choose the pooling method. In my point of view, the authors have not really been focused on providing a pooling method which can help the current conditions to be improved effectively. At least, trying to extract a better pattern for relating results to the dataset structure could be so helpful.<br />
* Average pooling origins which are mentioned as the main pooling algorithm to compare with, is not even referenced in the introduction.<br />
* Combination of the wavelet, Max and Average pooling can be an interesting option to investigate more on this topic; both in a row(Max/Avg after wavelet pooling) and combined like mix pooling option.<br />
* While the current datasets express the performance of the proposed method in an appropriate way, it could be a good idea to evaluate the method using some larger datasets. Maybe it helps to understand whether the size of a dataset can affect the overfitting behavior of max pooling which is mentioned in the paper.<br />
* Adding asymptotic notations to the computational complexity of the proposed algorithm would be meaningful, particularly since the given results are for a single/fixed input size (one image in forward propagation) and consequently are not generalizable. <br />
* They could have considered comparing against Fast Fourier Transform (FFT). Including a non-wavelet form seems to be an obvious candidate for comparison<br />
* If they went beyond the 2x2 pooling window this would have further supported their method<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) The experiments are largely conducted with very small scale datasets. As a result, I am not sure if they are representative enough to show the performance difference between different pooling methods.<br />
* ([[https://openreview.net/forum?id=rkhlb8lCZ]]) No comparison to non-wavelet methods. For example, one obvious comparison would have been to look at using a DCT or FFT transform where the output would discard high-frequency components (this can get very close to the wavelet idea!). Also, this critique might provides us with some interesting research directions since DCT or FFT transforms as pooling are not throughly studied yet.<br />
* Also, convolutional neural network are not only used in image related tasks. Evaluating the efficiency of wavelet pooling in convolutional neural network applied to natural languages or other applicable areas will be interesting. Such experiments shall also show if such approach can be generalized. <br />
<br />
== References ==<br />
<br />
Williams, Travis, and Robert Li. "Wavelet Pooling for Convolutional Neural Networks." (2018).<br />
<br />
Hilton, Michael L., Björn D. Jawerth, and Ayan Sengupta. "Compressing still and moving images with wavelets." Multimedia systems 2.5 (1994): 218-227.<br />
<br />
<br />
== Revisions == <br />
<br />
*Two reviewers really liked the paper and one of them called it in the top 15% papers in the conference which supports the novelty and potential of the idea. One other reviewer, however, believed that this was not good enough to be accepted and the main reason for rejection was the linearity nature of wavelet(which was not convincingly described). <br />
<br />
*The main concern of two of the reviewers has been the size of the datasets that have been used to test the method and the authors have mentioned future works concerning bigger datasets to test the method.<br />
<br />
*The computational cost section had not been included in the paper initially and was added after one of the reviewer's concern. So, the other reviewers have not been curious about this and unfortunately, there is no comment on that from them. However, the description on the non-efficient implementation seemed to be satisfactory to the reviewer which resulted in being accepted. <br />
<br />
[https://openreview.net/forum?id=rkhlb8lCZ Revisions]<br />
<br />
At the end, if you are interested in implementing the method, they are willing to share their code but after making it efficient. So, maybe there will be another paper regarding less computational cost on larger datasets with a publishable code.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DeepVO_Towards_end_to_end_visual_odometry_with_deep_RNN&diff=42376DeepVO Towards end to end visual odometry with deep RNN2018-12-11T00:08:06Z<p>Msminhas: Editorial</p>
<hr />
<div>== Introduction ==<br />
Visual Odometry (VO) is a computer vision technique for estimating an object’s position and orientation from camera images. It is an important technique commonly used for “pose estimation and robot localization” with notable applications in Mars Exploration Rovers and Autonomous Vehicles [x1] [x2]. While the research field of VO is broad, this paper focuses on the topic of monocular visual odometry. Particularly, the authors examine prominent VO methods and argue that mainstream geometry based monocular VO methods should be amended with deep learning approaches. Deep Learning (DL) has recently achieved promising results in computer vision tasks but does not include the VO field, thus the paper proposes a novel deep-learning based end-to-end VO algorithm and then empirically demonstrates its viability.<br />
<br />
== Related Work ==<br />
<br />
Visual odometry algorithms can be grouped into two main categories. The first is known as the conventional methods, and they are based on established principles of geometry. Specifically, an object’s position and orientation (pose) are obtained by identifying reference points and calculating how those points change over the image sequence. Algorithms in this category can be further divided into two: sparse feature based methods and direct methods, which differ in the method employed to select reference points. Sparse feature based methods establish reference points using image salient features such as corners and edges [8]. Direct methods, on the other hand, make use of the whole image and consider every pixel as a reference point [11]. Recently, semi-direct methods that combine the benefits of both approaches are gaining popularity [16].<br />
<br />
Today, most of the state-of-the-art VO algorithms belong to the geometry family. However, they suffer significant limitations. For example, direct methods assume “photometric consistency” [11]. Sparse feature based methods are also prone to “drifting” because of outliers and noises. As a result, the paper argues that geometry-based methods are difficult to engineer and calibrate, limiting their practicality. Figure 1 illustrates the general architecture of geometry-based algorithms. It outlines necessary drift correction techniques such as Camera Calibration, Feature Detection, Feature Matching (tracking), Outlier Rejection, Motion Estimation, Scale Estimation, and Local optimization (bundle adjustment).<br />
<br />
[[File:DeepVO_Figure_1.png | center]]<br />
<br />
<div align="center">Figure 1. Architectures of the conventional geometry-based monocular VO method.</div><br />
<br />
The second category of VO algorithms is based on learning. Namely, they try to learn an object’s motion model from labeled optical flows. Initially, these models are trained using classic Machine Learning techniques such as k-nearest neighbors (KNNs) [15], Gaussian Processes [16], and Support Vector Machines [17]. However, these models were inefficient to handle highly non-linear and high-dimensional inputs, leading to poor performance in comparison with geometry-based methods. More recently, deep learning based approaches are dominating research and are producing many promising results. For example, Convolutional Neural Network (CNN) based models can now recognize places based on appearance [18] and detect direction and velocity from stereo inputs [20]. Moreover, a deep learning model even achieved robust VO with blurred and under-exposed images [21]. While these successes are encouraging, the authors observe that a CNN based architecture is “incapable of modeling sequential information.” Instead, they proposed to use Recurrent Neural networks (RNN) to tackle this problem.<br />
<br />
== End-to-End Visual odometry through RCNN ==<br />
<br />
=== Architecture Overview ===<br />
An end-to-end monocular VO model is proposed by utilizing deep Recurrence Convolutional Neural Network (RCNN). Figure 2 depicts the end-to-end model, which is comprised of three main stages. First, the model takes a monocular video as input and pre-processes the image sequences by “subtracting the mean RGB values of all frames” from each frame. Then, consecutive image sequences are stacked to form tensors, which become the inputs for the CNN stage. The purpose of the CNN stages is to extract salient features from the image tensors. The structure of the CNN is inspired by FlowNet [24], which is a model designed to extract optical flows. Details of the CNN structure is shown in Table 1. In this architecture, the size of the receptive fields in the network are gradually reduced from 7x7 to 5x5 and then to 3x3 to capture small interesting features. Zero-paddings are introduced either adapt to the configurations of receptive fields or preserve the spatial dimension of the tensor after convolution. The CNN takes raw RGB images as input. The output is a compressed representation of the features of optical flow. Using CNN optical flow features as input, the RNN stage tries to estimate the temporal and sequential relations among the features. The RNN stage does this by utilizing two Long Short-Term Memory networks (LSTM), which estimate object poses for each time step using both long-term and short-term dependencies. Figure 3 illustrates the RNN architecture.<br />
<br />
Without the LSTM framework, RNNs often experience vanishing gradients or gradient exploding. If the gradient is small and the network is deep, when it is propagated to the shallower layers during the backward pass, it often just becomes too small to have an effect on the weights. This forces standard RNN architectures to be relatively shallow for temporal prediction over time. In other words, the weight update for recent events will have a much larger effect on the network weights than events happened long-time ago. Visual odometry is a very complex problem, and thus we attempt to learn highly complex functions within the network. Hence, to circumvent the vanishing gradient issue, we use LSTM nodes. Conversely, LSTM can handle long-term dependencies and has deep temporal structure, but needs depth on network layers to learn complex high-level representation. LSTM define three additional gates: forget gate, input gate and update gate to help better capture the long-term dependencies. Deep RNNs have been shown to perform well on complex dynamic representations (e.g. speech recognition), and thus we leverage this architecture and layer multiple LSTM layers to mitigate vanishing gradient without losing the network's ability to represent complex dynamics.<br />
<br />
[[File:DeepVO_Figure_2.png | center]]<br />
<div align="center">Figure 2. Architectures of the proposed RCNN based monocular VO system.</div><br />
<br />
[[File:DeepVO_Table_1.png | center]]<br />
<div align="center">Table 1. CNN structure</div><br />
<br />
[[File:DeepVO_Figure_3.png | center]]<br />
<div align="center">Figure 3. Folded and unfolded LSTMs and its internal structure.</div><br />
<br />
=== Training and Optimization ===<br />
The proposed RCNN model can be represented as a conditional probability of poses <math> Y_{t} = (y_{1},...y_{t}) </math> given an image sequence <math> X_{t} = (x_{1},...x_{t}) </math>: <br />
<br />
\begin{align}<br />
p(Y_{t}|X_{t}) = p(y_{1},...,y_{t}|x_{1},...,x_{t})<br />
\end{align}<br />
<br />
To find optimal parameters, the Deep Neural Networks (DNN) maximizes:<br />
<br />
\begin{align}<br />
\theta^{*}=argmax(Y{t}|X{t};\theta)<br />
\end{align}<br />
<br />
To learn the parameters <math>\theta</math> of the DNNs, the Euclidean distance between the ground truth pose <math>(p_k,\phi_k)</math> at time <math>k</math> and its estimated one <math>(\hat{p}_k,\hat{\phi}_k)</math> is minimized. The loss function is composed of Mean Square Error (MSE) of all positions <math>p</math> and orientations <math>\varphi</math> minimizes:<br />
<br />
\begin{align}<br />
\theta^{*}=argmin\frac{1}{N}\sum_{N}^{i=1}\sum_{t}^{k=1}||\hat{p}_{k}-p_{k}||_{2}^{2}+\kappa||\hat{\varphi}_{k}-\varphi_{k}||_{2}^{2}<br />
\end{align}<br />
<br />
where <math>|| *||</math> is <math>L_{2}</math> norm, <math>\kappa</math> (100 in the experiments) is a scale factor to balance the weights of positions and orientations, <math>N</math> is the number of samples, and the orientation <math>φ</math> is represented by Euler angles.<br />
<br />
== Experiments and Results ==<br />
The paper evaluates the proposed RCNN VO model by comparing it empirically with the open-source VO library of LIBVISO2 [7], which is a well-known geometry based model. The comparison is done using the KITTI VO/SLAM benchmark [3], which contains 22 image sequences, 11 of which are labeled with ground truths. Two separate experiments are performed. <br />
<br />
1. Quantitatively Analysis is performed using only labeled image sequence. Namely, 4 of 11 image sequences were used for training and the others reserved for testing. Table 2 and Figure 6 outlines the result, showing that the proposed RCNN model performs consistently better than the monocular VISO2_M model. However, it performs worse than the stereo VISO2_S model.<br />
<br />
<br />
[[File:DeepVO_Table_2.png |500px| center]]<br />
<br />
<br />
[[File:DeepVO_Figure_6.png |500px| center]]<br />
<br />
<br />
2. The generalizability of the proposed RCNN model is evaluated using the unlabeled image sequences. Figure 8 outlines the test result, showing that the proposed model is able to generalize better than the monocular VISO2_M model and performs roughly the same as the stereo VISO2_S model.<br />
<br />
<br />
[[File:DeepVO_Figure_8.png |600px| center]]<br />
<br />
== Conclusions ==<br />
The paper presents a new RCNN VO model that combines the CNNs with the RNNs under the power of Deep RCNNs. It can achieve representation learning while sequential modeling of the monocular VO. Although it is considered a viable approach, it is not expected to be a replacement to the classic geometry-based approach. However, from the experiment result, it can be a viable complement by combining geometry and DNN learning representations, knowledge and models to further improve VO's accuracy and robustness. The main contribution of the paper is threefold: <br />
<br />
# The authors demonstrate that the monocular VO problem can be addressed in an end-to-end fashion based on DL, i.e., directly estimating poses from raw RGB images. Neither prior knowledge nor parameter is needed to recover the absolute scale. <br />
#The authors propose a RCNN architecture enabling the DL based VO algorithm to be generalised to totally new environments by using the geometric feature representation learned by the CNN. <br />
# Sequential dependence and complex motion dynamics of an image sequence, which are of importance to the VO but cannot be explicitly or easily modeled by a human, are implicitly encapsulated and automatically learnt by the RCNN.<br />
<br />
== Critiques ==<br />
<br />
This paper cannot be considered as a critical advance to the state of the art as the authors just suggest a method combining CNN and RNNs for the visual odometry problem. The authors also state that deep learning in terms of simple feed-forward Neural networks and CNNs has already been used in this problem. Only an RNN approach seems to have been not tried on this problem. The authors propose a combined RCNN and geometric-based approach towards the end of the paper. But it is not intuitive how these two potentially very diverse methods could be combined. The authors also do not explain any proposed methods for the combination. The authors don't build a compelling case against the state of the art methods or convincingly prove the superiority of the RCNN or a combined method. For example, the RCNN and other state of the art geometry-based methods have a deficiency of getting lower accuracies when shown a large open area in the images as mentioned by the authors. The authors put forth some techniques to solve this problem for the geometry approaches but they state that they do not have a similar method for the deep learning based approaches. Thus, in such scenarios, the methods proposed by the authors don't seem to work at all. <br />
<br />
The paper advances the field of deep-learning based VO by creating a pioneering end-to-end model that is capable of extracting features and learning sequential dynamics from monocular videos. While the new model clearly outperforms the LIBVISO2_M algorithm, it fails to demonstrate any advantage over the LIBVISO2_S algorithm. Hence, it makes one question whether the complexity of deep-learning based monocular VO methods is justified and whether robots or autonomous vehicles designers should opt for stereo visions as much as possible. Nonetheless, this end-to-end model is beneficial for situations where monocular VO is the only viable option. Furthermore, the paper could have benefited by including a qualitative comparison of the algorithm’s computation requirements, such as hardware specification, engineering time, and training time. Though the justification for input sequence pre-processing is not explained completely, it can be attributed to the fact that they are using standard pre-processing techniques like mean Subtraction and normalization, which helps in easier optimization of cost functions. Perhaps, future-works could involve adapting the model for real-time visual odometry.<br />
<br />
== Other Sources ==<br />
# Code (not original authors) can be found at [https://github.com/sladebot/deepvo] and [https://github.com/themightyoarfish/deepVO].<br />
# Presentation slides can be found here [https://www.slideshare.net/JackyLiu40/deepvo-towards-visual-odometry-with-deep-learning].<br />
<br />
== References ==<br />
[1] S. Wang, R. Clark, H. Wen and N. Trigoni, "DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.<br />
<br />
[2] M. Maimone, Y. Cheng, and L. Matthies, "Two years of Visual Odometry on the Mars Exploration Rovers," Journal of Field Robotics. 24 (3): 169–186, 2007.<br />
<br />
[3] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.<br />
<br />
[7] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011.<br />
<br />
[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.<br />
<br />
[11] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.<br />
<br />
[15] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch, “Memory-based learning for visual odometry,” in Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2008, pp. 47–52.<br />
<br />
[16] V. Guizilini and F. Ramos, “Semi-parametric learning for visual odometry,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 526–546, 2013.<br />
<br />
[17] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.<br />
<br />
[18] N. Su ̈nderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.<br />
<br />
[20] A. Kendall, M. Grimes, and R. Cipolla, “Convolutional networks for real-time 6-DoF camera relocalization,” in Proceedings of International Conference on Computer Vision (ICCV), 2015.<br />
<br />
[21] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp.18–25, 2016.<br />
<br />
[24] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2758–2766.<br />
<br />
[25]http://cs231n.github.io/neural-networks-2/</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning&diff=42375Synthesizing Programs for Images usingReinforced Adversarial Learning2018-12-11T00:05:12Z<p>Msminhas: Editorial</p>
<hr />
<div>'''Synthesizing Programs for Images using Reinforced Adversarial Learning: ''' Summary of the ICML 2018 paper <br />
<br />
Paper: [[http://proceedings.mlr.press/v80/ganin18a.html]]<br />
Video: [[https://www.youtube.com/watch?v=iSyvwAwa7vk&feature=youtu.be]]<br />
<br />
== Presented by ==<br />
<br />
1. Nekoei, Hadi [Quest ID: 20727088]<br />
<br />
= Motivation =<br />
<br />
Conventional neural generative models have major problems. <br />
<br />
* It is not clear how to inject knowledge about the data into the model. <br />
<br />
* Latent space is not easily interpretative. <br />
<br />
The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.<br />
<br />
= Introduction =<br />
<br />
Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works. <br />
<br />
In the visual domain, inversion of a renderer for the purposes of scene understanding is typically referred to as inverse graphics. However, training vision systems using the inverse graphics approach has remained a challenge. Renderers typically expect as input programs that have sequential semantics, are composed of discrete symbols (e.g., keystrokes in a CAD program), and are long (tens or hundreds of symbols). Additionally, matching rendered images to real data poses an optimization problem as black-box graphics simulators are not differentiable in general. <br />
<br />
To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.<br />
<br />
[[File:Fig1 SPIRAL.PNG | 400px|center]]<br />
<br />
== Related Work ==<br />
Related works in this filed is summarized as follows:<br />
* There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)<br />
<br />
* Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)<br />
<br />
* Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)<br />
<br />
* inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).<br />
<br />
'''However, all of these methods have limitations such as:''' <br />
<br />
* Scaling to larger real-world datasets<br />
<br />
* Requiring hand-crafted parses and supervision in the form of sketches and corresponding images<br />
<br />
* Lack the ability to infer structured representations of images<br />
<br />
= The SPIRAL Agent =<br />
=== Overview ===<br />
The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes. <br />
In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>p_d</math>.<br />
<br />
== Objectives ==<br />
The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.<br />
<br />
'''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as: <br />
<br />
\begin{align}<br />
\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R<br />
\end{align}<br />
<br />
where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).<br />
<br />
'''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:<br />
<br />
<br />
\begin{align}<br />
\mathcal{L}_G = -\sum_{t}\log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]<br />
\end{align}<br />
<br />
<br />
where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a <br />
1-sample Monte-Carlo estimate of the return. Rewards are set to:<br />
<br />
<math><br />
r_t = \begin{cases}<br />
0 & \text{for } t < N \\<br />
D(\mathcal{R}(a_1, a_2, \cdots, a_N)) & \text{for } t = N<br />
\end{cases}<br />
</math><br />
<br />
<br />
<br />
One interesting aspect of this new formulation is that <br />
the search can be biased by introducing intermediate rewards<br />
which may depend not only on the output of R but also on<br />
commands used to generate that output.<br />
<br />
== Conditional generation: ==<br />
In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:<br />
<math><br />
p_g = R(p_a(a|x_{target}))<br />
</math><br />
<br />
While <math>p_{d}</math> becomes a Dirac-<math>\delta</math> function centered at <math>x_{target}</math>. <br />
For the first two terms in the objective function for D, they reduce to <br />
<math><br />
-D(x_{target}|x_{target})+ \mathbb{E}_{x\sim p_g}[D(x|x_{target})] <br />
</math><br />
<br />
It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator. It may be as a poor candidate for the reward signal of the generator, even if it is not the only solution of the objective function for D.<br />
<br />
===Traditional GAN generation: ===<br />
Traditional GANs use the following minimax objective function to quantify optimality for relationships between D and G:<br />
<br />
[[File:edit7.png| 400px|center]]<br />
<br />
Minimizing the Jensen-Shannon divergence between the two distribution often leads to vanishing gradients as the discriminator saturates. We circumvent this issue using the conditional generation function, which is much better behaved.<br />
<br />
== Distributed Learning: ==<br />
The training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, three kinds of workers are defined:<br />
<br />
<br />
* Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate<br />
renderings produced by R.<br />
<br />
<br />
* A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), <math>\mathcal{L}_G</math> is augmented with an entropy penalty encouraging exploration.<br />
<br />
<br />
* In contrast to the base '''IMPALA''' setup, an additional discriminator learner is defined. This worker consumes random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).<br />
<br />
[[File:Fig2 SPIRAL Architecture.png | 700px|center]]<br />
<br />
'''Note:''' no trajectories are omitted in the policy learner. Instead, the <math>D</math> updates is decoupled from the <math>\pi</math> updates by introducing a replay buffer that serves as a communication layer between the actors and the discriminator learner. That allows the latter to optimize <math>D</math> at a higher rate than the training of the policy network due to the difference in network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain '''CNN'''). Even though sampling from a replay buffer inevitably results in smoothing of <math>p_{g}</math>, this setup is found to work well in practice.<br />
<br />
= Experiments=<br />
<br />
<br />
== Environments ==<br />
Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint<br />
contributors, 2018).) is used. The agent controls a brush and produces<br />
a sequence of (possibly disjoint) strokes on a canvas<br />
C. The state of the environment is comprised of the contents<br />
of <math>C</math> as well as the current brush location <math>l_{t}</math>. Each action<br />
<math>a_{t}</math> is a tuple of 8 discrete decisions <math>(a_t^1; a_t^2; ... ; a_t^8)</math> (see<br />
Figure 3). The first two components are the control point <math>p_{c}</math><br />
and the endpoint <math>l_{t+1}</math> of the stroke.<br />
<br />
[[File:Fig3_agent_action_space.PNG | 450px|center]]<br />
<br />
The next 5<br />
components represent the appearance of the stroke: the<br />
pressure that the agent applies to the brush (10 levels), the<br />
brush size, and the stroke color characterized by a mixture<br />
of red, green and blue (20 bins for each color component).<br />
The last element of at is a binary flag specifying the type<br />
of action: the agent can choose either to produce a stroke<br />
or to jump right to <math>l_{t+1}</math>.<br />
<br />
In the MUJOCO SCENES experiment, we render images<br />
using a MuJoCo-based environment (Todorov et al., 2012).<br />
At each time step, the agent has to decide on the object<br />
type (4 options), its location on a 16 <math>\times</math> 16 grid, its size<br />
(3 options) and the color (3 color components with 4 bins<br />
each). The resulting tuple is sent to the environment, which<br />
adds an object to the scene according to the specification.<br />
<br />
== Datasets ==<br />
<br />
=== MNIST ===<br />
For the MNIST dataset, two sets of experiments are conducted:<br />
<br />
1- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by the discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a. <br />
<br />
2- In the second experiment, an agent is trained to reproduce a given digit. <br />
Several examples of conditional generated digits are shown in Fig 4b. <br />
<br />
[[File:Fig4a MNIST.png | 450px|center]]<br />
<br />
=== OMNIGLOT ===<br />
Now the trained agents are tested in a similar but more challenging setting of handwritten characters. As can be seen in Fig 5a, the unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b). Moreover, as OMNIGLOT has lots of different symbols, the model that we created was able to learn a general idea of image production without memorizing the training data. We tested this result by inputting new unseen line drawings to our trained agent. As we concluded, it provided excellent results as shown in Figure 6. <br />
<br />
[[File:Fig5 OMNIGLOT.png | 450px|center]]<br />
<br />
<br />
For the MNIST dataset, two kinds of rewards, discriminator score and <math>l^{2}-\text{distance}</math> has been compared. Note that the discriminator based approach has a significantly lower training time and lower final <math>l^{2}</math> error.<br />
Following (Sharma et al., 2017), also a “blind” version of the agent without feeding any intermediate canvas states as an input to <math>\pi</math> is trained. The training curve for this experiment is also reported in Fig 8a. <br />
(dotted blue line) The results of training agents with discriminator based and <math>l^{2}-\text{distance}</math> approach is shown in Fig 8a as well.<br />
<br />
=== CELEBA ===<br />
<br />
Since the ''libmypaint'' environment is also capable of producing<br />
complex color paintings, this direction is explored by<br />
training a conditional agent on the CELEBA dataset. In this<br />
experiment, the agent does not receive any intermediate rewards.<br />
In addition to the reconstruction reward (either <math>l^2</math> or<br />
discriminator-based), earth mover’s<br />
distance between the color histograms of the model’s output<br />
and <math>x_{target}</math> is penalized. (Figure 7)<br />
<br />
[[File:Fig6 CELEBA.png | 450px|center]]<br />
<br />
Although blurry, the model’s reconstruction closely matches<br />
the high-level structure of each image such as the<br />
background color, the position of the face, and the color of<br />
the person’s hair. In some cases, shadows around eyes and<br />
the nose is visible.<br />
<br />
=== MUJOCO SCENES ===<br />
<br />
For the MUJOCO SCENES dataset, the trained agent is used to construct simple CAD programs that best explain input images. Here only the case of the conditional generation is considered. Like before, the reward function for the generator can be either the <math>l^2</math> score or the discriminator output. In addition, there are not any auxiliary reward signals. This model has the capacity to infer and represent up to 20 objects and their attributes due to its unrolled 20 time steps.<br />
<br />
As shown in Figure 8b, the agent trained to directly minimize<br />
<math>l^2</math> is unable to solve the task and has significantly<br />
higher pixel-wise error. In comparison, the discriminator based<br />
variant solves the task and produces near-perfect reconstructions<br />
on a holdout set (Figure 10).<br />
<br />
[[File:Fig8 MUJOCO_SCENES.png | 500px|center]]<br />
For this experiment, the total number of possible execution traces is <math>M^N</math>, where <math>M = 4·16^2·3·4^3·3 </math> is the total number of attribute settings for a single object and N = 20 is the length of an episode. Then a general-purpose Metropolis-Hastings inference algorithm that samples an execution trace defining attributes for a maximum of 20 primitives was run on a set of 100 images. These attributes are considered as latent variables. During each time step of the inference, the attribute blocks (including presence/absence tags) corresponding to a single object are evenly flipped over the appropriate range. The resulting trace is presented as an output sample by the environment and then the output sample is accepted or rejected using the Metropolis-Hastings update rule, where the Gaussian likelihood is centered on the test image and the fixed diagonal covariance is 0.25. From Figure 9, the MCMC search baseline cannot solve the task even after a lot of evaluation.<br />
[[File:figure9 mcmc.PNG| 500px|center]]<br />
<br />
= Discussion =<br />
As in the OMNIGLOT<br />
experiment, the <math>l^2</math>-based agent demonstrates some<br />
improvements over the random policy but gets stuck and as<br />
a result fails to learn sensible reconstructions (Figure 8b).<br />
<br />
[[File:Fig7 Results.png | 500px|center]]<br />
<br />
<br />
Scaling visual program synthesis to the real world and combinatorial<br />
datasets has been a challenge. It has been shown that it is possible to train an adversarial generative agent employing<br />
black-box rendering simulator. Our results indicate that<br />
using the Wasserstein discriminator’s output as a reward<br />
function with asynchronous reinforcement learning can provide<br />
a scaling path for visual program synthesis. The current<br />
exploration strategy used in the agent is entropy-based but<br />
future work should address this limitation by employing sophisticated<br />
search algorithms for policy improvement. For<br />
instance, Monte Carlo Tree Search can be used, analogous<br />
to AlphaGo Zero (Silver et al., 2017). General-purpose<br />
inference algorithms could also be used for this purpose.<br />
<br />
= Critique and Future Work =<br />
* The architecture isn't new but it's a nice application and it's fun to watch the video of the robot painting in real life. SPIRAL's GAN-like idea continues the vein of [https://arxiv.org/abs/1610.01945 connecting actor-critic RL with GANs] like " [https://arxiv.org/abs/1706.03741Deep reinforcement learning from human preferences]" , Christiano et al 2017 or GAIL:<br />
<br />
* Future work should explore different parameterizations of action spaces. For instance, the use of two arbitrary control points is perhaps not the best way to represent strokes, as it is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes, and learned texture models to invert richer visual scenes. <br />
<br />
* A potential application can be in the field of medical images specifically to enhance and recolor histopathology slides for better detection. Also, image restoration problems could be addressed based on these approaches.<br />
<br />
* On the reward side, using a joint image-action discriminator similar to BiGAN/ALI (Donahue et al., 2016; Dumoulin et al., 2016) (in this case, the policy can be viewed as an encoder, while the renderer becomes a decoder) could result in a more meaningful learning signal, since D will be forced to focus on the semantics of the image.<br />
<br />
= Other Resources =<br />
#Code implementation [https://github.com/carpedm20/SPIRAL-tensorflow]<br />
<br />
= References =<br />
<br />
# Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S.M. Ali Eslami, Oriol Vinyals, [[https://arxiv.org/abs/1804.01118]].</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity&diff=42374stat946F18/differentiableplasticity2018-12-11T00:01:56Z<p>Msminhas: Technical,Editorial</p>
<hr />
<div>'''Differentiable Plasticity: ''' Summary of the ICML 2018 paper https://arxiv.org/abs/1804.02464<br />
<br />
= Presented by =<br />
<br />
1. Ganapathi Subramanian, Sriram [Quest ID: 20676799]<br />
<br />
= Motivation =<br />
Machine Learning models often employ extensive training over a massive dataset of training examples in order to learn a single complex task very well. However, biological agents contrast this learning style by exhibiting a remarkable ability to learn quickly and efficiently from ongoing experience. <br />
<br />
1. Neural Networks naturally have a static architecture. Once a Neural Network is trained, the network architecture components (ex. network connections) cannot be changed and effectively, learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch. <br />
<br />
2. Plasticity is the characteristic of biological systems present in humans, which can change network connections over time. For instance, animals can learn to navigate and remember the location and optimal path to food sources. This enables lifelong learning in biological systems and thus, allows for adaptation to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity, which is based on the Hebb's rule (i.e. if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened). Neural networks are very far from achieving synaptic plasticity. <br />
<br />
3. Differentiable plasticity is a step in this direction. The behavior of the plastic connection is trained using gradient descent so that the previously trained networks can adapt to changing conditions thus mimicking the dynamic learning of rewarding or detrimental behavior.<br />
<br />
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning, the agent can develop a knowledge about any alphabet, including those that it has never been exposed to during training.<br />
<br />
= Objectives =<br />
The paper has the following objectives: <br />
<br />
1. To tackle the problem of meta-learning (learning to learn). <br />
<br />
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training. <br />
<br />
3. To use backpropagation to optimize both the base weights and the amount of plasticity in each connection. <br />
<br />
4. To demonstrate the performance of such networks on three complex and different domains, namely complex pattern memorization, one shot classification, and reinforcement learning.<br />
<br />
= Important Terms =<br />
<br />
Hebb’s rule: This is a famous rule in neuroscience. It defines the relationship of activities between neurons with their connection. It states that if a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened. Also summarized as "neurons that fire together, wire together".<br />
<br />
= Related Work =<br />
<br />
Previous Approaches to solving this problem are summarized below: <br />
<br />
1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. <br />
<br />
2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes the activation function at each step. The network has a high bias towards the recently seen patterns. <br />
<br />
3. Optimize the learning rule itself, instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. <br />
<br />
4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. <br />
<br />
5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. <br />
<br />
6. For classification tasks, the idea of learning a “new object” is analogous to understanding how the embedding of a test example relates to the embeddings of classes known in the test set. Specifically, once we have embeddings to represent a particular class, given new data, we simply extract the embedding of the test sample and connect it to an embedding with a known class (through whichever distance metric we decide to use). Note, however, this does not actually “learn-to-learn”, in that the process of prediction never changes. Embeddings are always held constant, unless the test cases, when classified, are used to redefine the prototypical embedding of a class.<br />
<br />
The superiority of the trainable synaptic plasticity for the meta-learning approach are as follows: <br />
<br />
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attention mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined<br />
by (trainable) network structure.<br />
<br />
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both<br />
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. <br />
<br />
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory.<br />
<br />
= Model =<br />
<br />
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. <br />
<br />
Model Components: <br />
<br />
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component. <br />
<br />
2. The fixed part is just a traditional connection weight, <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace, <math display = "inline">H_{i,j}</math>, which varies during a<br />
lifetime according to ongoing inputs and outputs.<br />
<br />
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity<br />
coefficient, <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form<br />
the full plastic component of the connection. <br />
<br />
The network equations for the output <math display = "inline">x_j(t)</math> of the neuron <math display = "inline">j</math> are as follows: <br />
<br />
<br />
<math display="block"><br />
x_j(t) = \sigma \Big\{\displaystyle \sum_{i \in ~\text{inputs}}[w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t)x_i(t-1)] \Big\}<br />
</math><br />
<br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = \eta x_i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) <br />
</math><br />
<br />
Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function, chosen to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs after being initialized to zero at each episode. In contrast, <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent and conserved across episodes.<br />
<br />
From the first equation above, a connection is fully fixed if <math display = "inline">\alpha = 0 </math>. Alternatively, a connection is fully plastic if <math display = "inline">w = 0</math>. Otherwise, the connection has both a fixed and plastic components. <br />
<br />
<br />
The <math display = "inline">\eta</math> denotes the learning rate, which is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2, the <math display = "inline">\eta</math> could make the Hebbian traces decay to 0 in the absence of input. This leads to the following form of the equation as follows: <br />
<br />
<br />
<math display="block"><br />
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t))<br />
</math><br />
<br />
The Hebbian trace is a representation of concurrent firing of <math>x_j, x_i</math> over past time-steps, and is meant to strengthen the connection between neurons that are often activated together.<br />
<br />
= Experiment 1 - Binary Pattern Memorization =<br />
<br />
<br />
<br />
This test involves quickly memorizing sets of arbitrary high-dimensional patterns and reconstructing the same while being exposed to partial, degraded versions of them. This is a very simple test as it is already known that hand designed recurrent networks with a Hebbian plastic connection can already solve it for binary patterns.<br />
<br />
<br />
<br />
[[File:binarypatternrecog.png | 650px|thumb|center|Figure 1: Pattern Memorization experiment - Input Structure and Architecture]]<br />
<br />
<br />
<br />
'''Steps in the experiment:''' <br />
<br />
1) The network is a set of five binary patterns in succession as shown in figure 1. Each of these patterns has 1,000 elements, for which each element is binary-valued (1 or -1). Here, dark red corresponds to the value 1, and dark blue corresponds to the value -1. <br />
<br />
2) The few shot learning paradigm is followed, where each pattern is shown for 10-time steps, with 3-time steps of zero input between the presentations and the whole sequence of patterns is presented 3 times in random order. <br />
<br />
3) One of the presented patterns is chosen in random order and degraded by setting half of its bits to 0. <br />
<br />
4) This degraded pattern is then fed to the network. The network has to reproduce the correct full pattern in its output using its memory that it developed during training. <br />
<br />
<br />
'''The architecture of the network is described as follows:''' <br />
<br />
1) It is a fully connected RNN with one neuron per pattern element, plus one fixed-output neuron (bias). There are a total of 1,001 neurons. <br />
<br />
2) Value of each neuron is clamped to the value of the corresponding element in the pattern if the value is not 0. If the value is 0, the corresponding neurons do not receive pattern input and must use what it gets from lateral connections and reconstruct the correct, expected output values. <br />
<br />
3) Outputs are read from the activation of the neurons. <br />
<br />
4) The performance evaluation is done by computing the loss between the final network output and the correct expected pattern. <br />
<br />
5) The gradient of the error over the <math display = "inline">w_{i,j}</math> and the <math display = "inline">\alpha_{i,j}</math> coefficients is computed by backpropagation and optimized through Adam solver with learning rate 0.001. <br />
<br />
6) The simple decaying Hebbian formula in Equation 2 is used to update the Hebbian traces. Each network has 2 trainable parameters <math display = "inline">w</math> and <math display = "inline">\alpha</math> for each connection, thus there are a total 1,001 <math display = "inline">\times</math> 1,001 <math display = "inline">\times</math> 2 = 2,004,002 trainable parameters. <br />
<br />
[[File:exp1results.png | 650px|thumb|center|Figure 2:Experiment 1 - Pattern Memorization Results]]<br />
<br />
<br />
The results are shown in figure 2 where 10 runs are considered. The error becomes quite low after about 200 episodes of training. <br />
<br />
[[File:exp1nonplasticresults.png| 650px|thumb|center|Figure 3: Pattern Memorization results with non plastic networks]]<br />
<br />
<br />
<br />
'''Comparison with Non-Plastic Networks:''' <br />
<br />
1) Non-plastic networks can solve this task but require additional neurons to solve this task in principle. In practice, the authors say that the task is not solved using Non-plastic RNN or LSTM. <br />
<br />
2) Figure 3 shows the results using non-plastic networks. The best results required the addition of 2000 extra neurons. <br />
<br />
3) For non-plastic RNN, the error flattens around 0.13 which is quite high. Using LSTMs, the task can be solved albeit imperfectly and also the error rate reduces drastically t0 around 0.001. <br />
<br />
4) The plastic network solves the task very quickly with the mean error going below 0.01 within 2000 episodes which are mentioned to be 250 times faster than the LSTM.<br />
<br />
= Experiment 2 - Memorizing network images=<br />
<br />
This task is an image reconstruction task that where a network is trained on a set of natural images which it looks to memorize. The natural images with graded pixel values contain more information per element as compared to the last experiment. So this experiment is inherently more complex than the previous ones. Then one image is chosen at random and half the image is displayed to the agent. The task is to complete the image. The paper shows that this method effectively solves this task which other state-of-the-art network architectures fail to solve. <br />
<br />
The experiment is as follows: <br />
<br />
1) Images are from the CIFAR-10 database where there are a total of 60000 images each of size 32 <math display = "inline">\times</math> 32. <br />
<br />
2) The architecture has 1025 neurons in total with a total of 2 <math display = "inline">\times</math> 1025 <math display = "inline">\times</math> 1025 = 2101250 parameters. <br />
<br />
3) Each episode has 3 pictures, shown 3 times for 20-time steps each time, with 3-time steps of zero input between the presentations. <br />
<br />
4) The images are degraded by zeroing out one full contiguous half of the image to prevent a trivial solution of simply reconstructing the missing pixel as the average of its neighbors.<br />
<br />
[[File:exp2results.png| 650px|thumb|center|Figure 4: Natural Image memorization results]]<br />
<br />
<br />
<br />
The results are shown in figure 4. The final output of the network is shown in the last column which is the reconstructed image. The results show that the model has learned to perform this task. <br />
<br />
[[File:exp2weights.png| 650px|thumb|center|Figure 5: Final matrices and plasticity coefficients]]<br />
<br />
The final weight matrix and plasticity coefficients matrix are shown in the figure 5. The plasticity matrix shows a structure related to the high correlation of neighboring pixels and half-field zeroing in test images. <br />
<br />
The full plastic network is compared against a similar architecture with shared plasticity coefficients, where all connections share the same <math display = "inline">\alpha</math> value. So, the single parameter is shared across all connections is trained. <br />
<br />
[[File:independentvsshared.png| 650px|thumb|center|Figure 6: Comparing independent and shared <math display = "inline">\alpha</math> value runs]]<br />
<br />
Figure 6 shows the result of comparison where the independent plasticity coefficient for each connection has better performances. Thus the structure observed in the weight matrices of the results is actually useful.<br />
<br />
<br />
= Experiment 3 - Omniglot task =<br />
<br />
This task involves handwritten symbol recognition. It is a standard task for one-shot and few-shot learning. <br />
<br />
===Experimental Setup: ===<br />
<br />
1) The Omniglot data set is a collection of handwritten characters from various writing systems, including 20 instances each of 1,623 different handwritten characters, written by different subjects.<br />
<br />
[[File:Omniglot Dataset.JPG|400px|center]]<br />
<br />
2) In each episode, N character classes are randomly selected and K instances from each class are sampled. <br />
<br />
3) These instances, together with the class label (from 1 to N), are shown to the model. <br />
<br />
4) Then, a new, unlabeled instance is sampled from one of the N classes and shown to the model.<br />
<br />
5) Model performance is defined as the model’s accuracy in classifying this unlabeled example.<br />
<br />
===Architecture: ===<br />
<br />
1) Model architecture has 4 convolutional layers with 3 <math display = "inline">\times</math> 3 receptive fields and 64 channels. <br />
<br />
2) All convolutions have a stride of 2 to reduce the dimensionality between layers. <br />
<br />
3) The output is a single vector of 64 features, which feeds into an N-way softmax. <br />
<br />
4) The label of the current character is also concurrently fed as a one-hot encoding to this softmax layer, to serve as a guide for the correct output when a label is present.<br />
<br />
===Plasticity in the architecture: ===<br />
<br />
1) Plasticity is applied to the weights from the final layer to the softmax layer, leaving the rest of the convolutional embedding non- plastic. <br />
<br />
2) The expectation is that the convolutional architecture will learn an adequate discriminant between arbitrary handwritten characters and the plastic weights learns to memorize associations between observed patterns and outputs. <br />
<br />
===Data Preparation: ===<br />
<br />
1) The dataset is augmented with rotations by multiples of <math display = "inline">90</math> degrees. <br />
<br />
2) It is divided into 1,523 classes for training and 100 classes (together with their augmentations) for testing. <br />
<br />
3) The networks are trained with an Adam optimizer with a learning rate 3 <math display = "inline">\times 10^{-5}</math>, multiplied by 2/3 every 1M episodes over 5,000,000 episodes. <br />
<br />
4) To evaluate final model performance, 10 models are trained with different random seeds and each of those is tested on 100 episodes using previously unseen test classes.<br />
<br />
===Results: ===<br />
<br />
1) The overall accuracy (i.e. the proportion of episodes with correct classification, aggregated over all test episodes of all runs) is 98.3%, with a 95% confidence interval of 0.80%.<br />
<br />
2) The median accuracy across the 10 runs was 98.5%, indicating consistency in learning.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Memory Networks<br />
! Matching Networks<br />
! ProtoNets<br />
! Memory Module<br />
! MAML<br />
! SNAIL<br />
! DP(This paper)<br />
|-<br />
| 82.8%<br />
| 98.1%<br />
| 97.4%<br />
| 98.4%<br />
| 98.7% <math display = "inline">\pm</math> 0.4<br />
| 99.07% <math display = "inline">\pm</math> 0.16<br />
| 98.03% <math display = "inline">\pm</math> 0.80<br />
|}<br />
<br />
<br />
<br />
3) The above table shows the comparative performance across other non-plastic approaches. The results of the plastic approach are largely similar to those reported for the computationally intensive MAML method and the classification-specialized Matching Networks method. <br />
<br />
4) The performances are slightly below those reported for the SNAIL method, which trains a whole additional temporal-convolution network on top of the convolutional architecture thus having many more parameters.<br />
<br />
5) The conclusion is that a few plastic connections to the output of the network allow for competitive one-shot learning over arbitrary man-made visual symbols.<br />
<br />
= Experiment 4 - Reinforcement learning Maze navigation task =<br />
<br />
This is a maze exploration task where the goal is to teach an agent to reach a goal. The plastic networks are shown to outperform non-plastic ones. <br />
<br />
Experimental setup: <br />
<br />
1) The maze is composed of 9 <math display = "inline">\times</math> 9 squares, surrounded by walls, in which every other square (in either direction) is occupied by a wall. <br />
<br />
[[File:exp4maze.png| 650px|thumb|center|Figure 7: Maze Environment]]<br />
<br />
<br />
2) The maze contains 16 wall square arranged in a regular grid as shown in the figure 7. <br />
<br />
3) At each episode, one non-wall square is randomly chosen as the reward location. When the agent hits this location, it receives a large reward (10.0) and is immediately transported to a random location in the maze Also a small negative reward of -0.1 is provided every time the agent tries to walk into a wall).<br />
<br />
4) Each episode lasts 250-time steps, during which the agent must accumulate as much reward as possible. The reward location is fixed within an episode and randomized across episodes. <br />
<br />
5) The reward is invisible to the agent, and thus the agent only knows it has hit the reward location by the activation of the reward input at the next step.<br />
<br />
6) Inputs to the agent consist of a binary vector describing the 3 <math display = "inline">\times</math> 3 neighborhood centered on the agent (each element is set to 1 or 0 if the corresponding square is or is not a wall), together with the reward at the previous time step. <br />
<br />
7) A2C algorithm is used to meta train the network. <br />
<br />
8) The experiments are run under three conditions: full differentiable plasticity, no plasticity at all, and homogeneous plasticity in which all connections share the same (learnable) <math display = "inline">\alpha</math> parameter. <br />
<br />
9) For each condition, 15 runs with different random seeds are performed. <br />
<br />
<br />
Architecture: <br />
<br />
1) It is a simple recurrent network with 200 neurons, with a softmax layer on top of it to select between the 4 possible actions (up, right, left or down).<br />
<br />
<br />
[[File:exp4performance.png| 650px|thumb|center|Figure 8: Performance curve for the maze navigation experiment]]<br />
<br />
<br />
Results: <br />
<br />
1) The results are shown in the figure 8. The plastic network shows considerably better performance as compared to the other networks.<br />
<br />
2) The non-plastic and homogeneous networks get stuck on a sub-optimal policy. <br />
<br />
3) Thus, the conclusion is that, in this domain, individually sculpting the plasticity of each connection is crucial in reaping the benefits of plasticity for this task.<br />
<br />
= Conclusions =<br />
<br />
<br />
The important contributions from this paper are as follows: <br />
<br />
1) The results show that simple plastic models support efficient meta-learning.<br />
<br />
2) Gradient descent itself is shown to be capable of optimizing the plasticity of a meta-learning system. <br />
<br />
3) The meta-learning is shown to vastly outperform alternative options in the considered experiments. <br />
<br />
4) The method achieved state of the art results on a hard Omniglot test set.<br />
<br />
= Open Source Code =<br />
<br />
Code for this paper can be found at: https://github.com/uber-common/differentiable-plasticity<br />
<br />
= Future Works = <br />
Dynamics presented in hebbian matrix enables the network to adapt dynamically. It would be interesting to complicate or change the dynamics of the way that plasticity comes in to play. <br />
<br />
= Critiques =<br />
<br />
The paper addresses an important problem of learning to learn ("meta-learning") and provides a novel framework based on gradient descent to achieve this objective. This paper provides a large scope for future work as many widely used architectures like LSTMs could be tried along with a plastic component. It is also easy to see that the application of such approaches in deep reinforcement learning are also plentiful and there is a good possibility of beating the current baselines in many popular testbeds like Atari games using plastic networks. This paper opens up possibilities for a whole class of meta-learning algorithms. <br />
<br />
With regards to the drawbacks of the paper, the paper does not mention how plastic networks will behave if the test sets are completely different from the training dataset. Will the performance be the same as non-plastic networks? It is not very clear if this method will be scalable as there are a large number of parameters to be determined even with the simplest of problems. Also, each experimental domain considered in this paper needed significantly different network architectures (for example in the Omniglot domain plasticity was applied only for the final layers). The paper does not mention any reasons for the specific decisions and if such differences will hold good for other similar problems as well. There has been work in transfer learning applied to both supervised learning and reinforcement learning problems. The authors should have ideally compared plastic networks to performances of some algorithms there as these methods transfer existing knowledge to other related problems and also prevent the need to start training from scratch much similar to the methods adopted in this paper.<br />
<br />
In Experiment 2, the reconstruction of CIFAR-10 images, the authors only provide sample reconstructed images. No quantitative assessment of results is done. It is difficult to judge the generalization of their results. Furthermore, from these results, the authors conclude that their model is good at reconstructing previously unseen images. This claim is quite broad given the relatively simple experiment that was conducted. They could have run experiments on a more complex dataset such as CIFAR-100 or perhaps SVHN. This is also evident from the network they used, which consisted of only 1000 neurons. Compared with the network in experiment 3, which consisted of a deep 4 layer CNN on a relatively simpler task of classification of Omniglot characters. It would have been more useful if the authors expanded on the image reconstruction task rather than displaying the learned plastic/non-plastic weights. For example, the removed pixels of test images could have been made more random, similar to experiment 1.<br />
<br />
= References = <br />
Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., and Ionescu, C. Using fast weights to attend to the recent past. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4331–4339. 2016.<br />
<br />
Bengio, Y., Bengio, S., and Cloutier, J. Learning a synaptic learning rule. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume 2, pp. 969–vol. IEEE, 1991.<br />
<br />
Dayan, P. and Abbott, L. F. Theoretical neuroscience, volume 806. Cambridge, MA: MIT Press, 2001. <br />
<br />
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2 : Fast reinforcement learning via slow reinforcement learning. 2016. URL http://arxiv.org/abs/1611.02779.<br />
<br />
Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135, 2017.<br />
<br />
Frank, M. J., Seeberger, L. C., and O’reilly, R. C. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–1943, 2004. Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. October 2014.<br />
<br />
Hebb, D. O. The organization of behavior: a neuropsychological theory. 1949.<br />
<br />
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.<br />
<br />
Hochreiter, S., Younger, A., and Conwell, P. Learning to learn using gradient descent. Artificial Neural Networks—ICANN 2001, pp. 87–94, 2001.<br />
<br />
Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.<br />
<br />
Kaiser, L., Nachum, O., Roy, A., and Bengio, S. Learning to remember rare events. In ICLR 2017, 2017.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding&diff=42373Pixels to Graphs by Associative Embedding2018-12-10T23:51:26Z<p>Msminhas: Editorial</p>
<hr />
<div>== Introduction == <br />
<br />
Extracting semantics from images is one of the main goals of computer vision. Recent years have seen rapid progress in the classification and localization of objects [7, 24, 10]. But a bag of labeled<br />
and localized objects is an impoverished representation of image semantics: it tells us what and where the objects are (“person” and “car”), but does not tell us about their relations and interactions (“person next to car”). A necessary step is thus to not only detect objects but to identify the relations between them. An explicit representation of this semantics is referred to as a scene graph where we represent objects grounded in the scene as vertices and the relationships between them as edges. [1]<br />
<br />
End-to-end training of convolutional networks has proven to be a highly effective strategy for image understanding tasks. It is therefore natural to ask whether the same strategy would be viable for predicting graphs from pixels. Existing approaches, however, tend to break the problem down into more manageable steps. For example, one might run an object detection system to propose all of the objects in the scene, then isolate individual pairs of objects to identify the relationships between them. This breakdown often restricts the visual features used in later steps and limits reasoning over the full graph and over the full contents of the image. [1]<br />
<br />
The paper presents a novel approach to generating a scene graph. A scene graph, as it relates to an image, is a graph with a vertex that represents each object identified in the image and an edge that represents relationships between the objects. <br />
<br />
An example of a scene graph:<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Scene Graph.PNG]]</div><br />
<br />
Current state-of-the-art techniques break down the construction of scene graphs by first identifying objects and then predicting the edges for any given pair of identified objects. By using this technique, reasoning over<br />
the full graph would be limited. On the other hand, this paper introduces an architecture that defines the entire graph directly from the image, enabling the network to reason across the entirety of the image to understand relationships, as opposed to only predicting relationships using object labels. <br />
<br />
A key concern, given that the new architecture produces both vertices (objects) and edges (relationships), is connecting the two. Specifically, the output of the network is some set of relationships E, and some set of vertices V. The network needs to also output the “source” and “destination” of each relationship so that the final graph can be formed. In the image above, for example, the network would also need to tell us that “holding” comes from “person” and goes to “Frisbee”. To do this, the paper uses associative embeddings. Specifically, the network outputs a particular “embedding vector” for each vertex, as well as a “source embedding” and “destination embedding” for each relationship. A final post-processing step finds the vertex embedding closest to each of the source/destination embeddings of each relationship and in this way assigns the edges to pairs of vertices.<br />
<br />
== Previous Work == <br />
<br />
In the field of relationship detection, the following are the existing state of the art advances:<br />
<br />
1) Framing the task of identifying objects using localization from referential expressions, detection of human-object interactions, or the more general tasks of Visual Relationship Detection (VRD) and scene graph generation. <br />
<br />
2) Visual relationship detection methods like message passing RNNs and predicting over triplets of bounding boxes. <br />
<br />
In the field of associative embedding, the following are some interesting applications: <br />
<br />
1) Vector embeddings to group together body joints for multi-person pose estimation. <br />
<br />
2) Vector embeddings to detect body joints of the various people in an image.<br />
<br />
<br />
Reference Figure from the paper "Associative embedding: End-to-end learning for joint detection and grouping."<br />
<br />
[[File:Oct30_associative_embedding_appendix_fig2.jpg | center]]<br />
<br />
<br />
== Pixels To Graphs == <br />
The goal of the paper is to construct a graph from a set of pixels. In particular, to construct a graph<br />
grounded in the space of these pixels. Meaning that in addition to identifying vertices of the graph,<br />
we want to know their precise locations. A vertex, in this case, can refer to any object of interest in the<br />
scene including people, cars, clothing, and buildings. The relationships between these objects is then<br />
captured by the edges of the graph. These relationships may include verbs (eating, riding), spatial<br />
relations (on the left of, behind), and comparisons (smaller than, same color as).<br />
<br />
Formally we consider a directed graph G = (V, E). A given vertex vi ∈ V is grounded at a location (<math>xi</math><br />
,<math>yi</math>) and defined by its class and bounding box. Each edge e ∈ E takes the form<br />
ei = (<math>vs</math>,<math>vt</math> ,<math>ri</math>) defining a relationship of type <math>r_i</math> from <math>vs</math> to <math>vt</math> . We train a network to explicitly define V and E. This training is done end-to-end on a single network, allowing the network to reason fully over the image and all possible components of the graph when making its predictions<br />
<br />
== The Architecture: == <br />
: '''1. Detecting Graph Elements'''<br />
<br />
Given an image of dimensions h x w, a stacked hourglass (Appendix 2) is used to generate a h x w x f representation of the image. It should be noted that the dimension of the output (which is non-trainable), needs to fulfill certain criteria. Specifically, we need to have a resolution large enough to minimize the number of pixels with multiple detections while also being small enough to ensure that each 1 x 1 x f vector still contains the information needed for subsequent inference.<br />
<br />
A 1x1 convolution and sigmoid activation is performed on this result to generate a heat map (one for objects and one for relationships, using separately determined convolutions). The value at a given pixel can be interpreted as the likelihood of detection at that particular pixel in the original image. <br />
<br />
In order to claim that there is an element at some pixel, we need to have some likelihood threshold. Then, if a given pixel in the map has a value >= the threshold, we claim that there is an element at that pixel. This threshold is calculated by using binary cross-entropy loss on the final values in the heat map. Values with likelihoods greater than p-hat will be considered element detections. <br />
<br />
Finally, for each element that we detected, we extract the 1 x 1 x f feature vector. This is then used as an input to a set of Feed Forward Neural Networks (FFNNs), where we have a separate network for each characteristic of interest, and for each network, there's one hidden layer with f nodes. The object class and relationship (edges) could be supervised by softmax loss. Furthermore, in order to predict the bounding box of the object, we can use the approach proposed by the Faster-RCNN model[3]. The following image summarizes the process.<br />
<br />
<br />
[[File:Extraction Process.PNG|center|900px]]<br />
<br />
:'''2. Connecting Elements with Associative Embeddings'''<br />
As explained earlier, to construct the scene graph, we need to know the source and destination of each edge. This is done through associative embeddings. <br />
<br />
First, let us define an embedding hi ϵ Rd produced for some vector i, and let us assume that we have n object detections in a particular image. Now, define hik, for k = 1 to Ki (where Ki is the number of edges in the graph with a vertex at vertex i) as the embedding associated with an edge that touches vertex i. We define two loss functions on these sets.<br />
<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 1.PNG]]</div><br />
<br />
The goal of Lpull is to minimize the squared differences between the embedding of a given vertex and the embedding of an edge that references said vertex.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 2.PNG]]</div><br />
<br />
On the other hand, minimizing Lpush implies assigning embeddings to vertices that are as far apart as possible. The further apart they are, the lower the output of max becomes until eventually, it reaches 0. Here, m is just a constant. In the paper, the values used were m = 8 and d = 8 (that is, 8D embeddings). Combining these two loss functions (and weighing them equally), accomplishes the task of predicting embeddings such that vertices are differentiated, but the embedding of a vertex is most similar to the vertex it references.<br />
<br />
:'''3. Support for Overlapping Detections'''<br />
An obvious concern is how the network would operate if there was more than one detection (be it object or relationship), in a given pixel. For example, detection of “shirt” and “person” may be centered at the exact same pixel. To account for this, the architecture is modified to allow for “slots” at each pixel. Specifically, so detections of objects are allowed at a particular pixel, while sr relationship detections are allowed at a given pixel. <br />
<br />
In order to allow for this, some changes are required after the feature extraction step. Specifically, we now use the 1x1xf vector as the input for so (or sr) different sets of 4 FFNNs, where the output (of the first three) is as shown in figure 2, and with the final FFNN outputting the probability of a detection existing in that particular slot, at that particular pixel. This new network is trained exclusively on whether or not a detection has been made in that slot, and, in prediction, is used to determine the number of slots to output at a given pixel. It is critical to note that this each of these so (or sr) sets of FFNNs share absolutely no weights. And each is trained for detection in its assigned slot.<br />
<br />
It is important to note that this implies a change in the training procedure. We now have so (or sr) different predictions (be it class, or class + bounding box), that we need to match with our set of ground truth detections at a given pixel. Without this step, we would not be able to assign a value to the error for that sample. To do this, we match a one-hot encoded vector of the ground-truth class and bounding box anchor (the reference vector), and then match them with the so (or sr) outputs provided at a given pixel. The Hungarian method is used to ensure maximum matching between the outputs and the reference method while ensuring we do not assign the same detection to multiple slots.<br />
<br />
==Results==<br />
A quick note on notation: R@50 indicates what percentage of ground-truth subject-predicate-object tuples appeared in a proposal of 50 such tuples. Since R@100 offers more possibilities, it will necessarily be higher. The 6.7, for example, indicates that 6.7% of the ground truth tuples appeared in the proposals of the network. <br />
<br />
The authors tested the network against two other architectures designed to develop a semantic understanding of images. For this, they used the Visual Genome dataset, with so = 3 and sr = 6. Overall, the new architecture vastly outperformed past models. The results were as follows:<br />
<br />
The table can be interpreted as follows:<br />
<br />
[[File:Results Table.PNG|center|600px]]<br />
<br />
::'''SGGen (no RPN)''': Given a particular image, without the use of Region Proposal networks, the accuracy of the proposed scene graph. No class predictions are provided.<br />
::'''SGGen (with RPN)''': Same as above, except the output of the Region Proposal Network, is used to enhance the input of a given image. No class predictions are provided.<br />
::'''SGCIs''': Ground-truth object bounding boxes are provided. The network is asked to classify them and determine relationships.<br />
::'''PredCIs''': As above, except the classes are also provided. The only goal is to predict relationships.<br />
<br />
Further analysis into the accuracy, when looking at predicates individually, shows that the architecture is very sensitive to over-represented relationship predicates.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Results - Part 2.PNG]]</div><br />
<br />
As shown in Figure 5, for many ground-truth predicates (those that do not appear often in the ground truth), the network does poorly. Even when allowed to propose 100 tuples, the network does not offer the predicate. Figure 4 simply observes the fact that certain sets of relationship predicates appear predominantly in a subset of slots. No general explanation has been offered for this behavior.<br />
<br />
== Conclusion ==<br />
In conclusion, the paper offers a novel approach that enables the extraction of image semantics while perpetually reasoning over the entire context of the image. Associative embeddings are used to connect object and predicate relationships, and parallel “slots” allow for multiple detections in one pixel. While this approach offers noticeable improvements in accuracy, it is clear that work needs to be done to account for the non-uniform distributions of relationships in the dataset.<br />
<br />
<br />
== Critiques ==<br />
<br />
The paper's contributions towards patterning unordered network outputs and using associative embeddings for connecting vertices and edges are commendable. However, it should be noted this paper is only an incremental improvement over existing well-studied architectures like the hourglass architecture. The modifications are not sufficiently supported by mathematical reasoning. The authors say that they make a slight modification to the hourglass design and double the number of features and weight all the loses equally. No scientific justification for why this is needed is given. Also the choice of constants to be 3 and 6 for <math display = "inline"> s_o</math> and <math display = "inline"> s_r</math> is not clear, as the authors leave out a fraction of the cases. I am not sure if the changes made are truly a critical advance as the experiments are conducted only on a single dataset and no generalizability arguments are made by the authors. So the methods might just work well only for this dataset and the changes may pertain to only this one. The theoretical analysis done in the paper comes directly from the hourglass literature and cannot be accounted for novelty.<br />
The paper could have identified the effect of their treatment by analyzing the structure of the network that they are presenting. However, there are lack of mathematical and structural analysis of each treatment that they are presenting in detailed levels.<br />
<br />
== Appendices ==<br />
<br />
'''Appendix 1: Sample Outputs'''<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Sample Pixel Graph Outputs.PNG]]</div><br />
<br />
'''Appendix 2: Stacked Hourglass Architecture'''<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Hourglass.PNG]]</div><br />
<br />
Although this goes beyond the focus of the paper, I would like to add a brief overview of the stacked hourglass architecture used to generate the heat map. This architecture is unique in that it allows cyclical top-down, bottom-up inference and recombination of features. While most architectures focus on optimizing the bottom-up portion (reducing dimensionality), the stacked-hourglass gives the network more flexibility in how it generates a representation by allowing it to learn a series of down-sampling / up-sampling steps.<br />
<br />
When you downsample and then upsample, a high amount of information is potentially lost on the upsampled reconstruction. Using the naive approach, this often results in poor reconstruction. This problem is accentuated when we stack multiple layers of downsampling and upsampling in the stacked hourglass architecture. To alleviate this issue, we add skip layers. Skip layers essentially allow earlier layers to send outputs into multiple later layers. The added information from the earlier layers ensures that the reconstructed embedding doesn't have its dimensionality reduced too much.<br />
<br />
[[File:skip+layers+Max+fusion+made+learning+difficult+due+to+gradient+switching..jpg|center|900px]]<br />
<br />
== References ==<br />
1. Alejandro Newell and Jia Deng, “Pixels to Graphs by Associative Embedding,” in NIPS, 2017<br />
<br />
2. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016<br />
<br />
3. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, pages 91–99, 2015.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search&diff=42372Hierarchical Representations for Efficient Architecture Search2018-12-10T23:47:55Z<p>Msminhas: Editorial</p>
<hr />
<div>Summary of the paper: [https://arxiv.org/abs/1711.00436 ''Hierarchical Representations for Efficient Architecture Search'']<br />
<br />
= Introduction =<br />
<br />
Deep Neural Networks (DNNs) have shown remarkable performance in several areas such as computer vision, natural language processing, among others; however, improvements over previous benchmarks have required extensive research and experimentation by domain experts. In DNNs, the composition of linear and nonlinear functions produce internal representations of data which are in most cases better than handcrafted ones; consequently, researchers using Deep Learning techniques have lately shifted their focus from working on input features to designing optimal DNN architectures. However, the quest for finding an optimal DNN architecture by combining layers and modules requires frequent trial and error experiments, a task that resembles the previous work on looking for handcrafted optimal features. As researchers aim to solve more difficult challenges the complexity of the resulting DNN is also increasing; therefore, some studies are introducing the use of automated techniques focused on searching for optimal architectures. The latest emerging field, Neural Architecture Search, is aimed to tackle exactly this problem. The goal of Neural Architecture Search is to try to transform the problem of designing a network into a search problem. For a search problem, it needs a clear definition of three things: the search space, the search strategy, and performance evaluation strategy. The search space is a high-level description of the architecture of the network. The search space needs to contain enough freedom such that the resulted model will have enough expressive power, but cannot be too broad thus makes the search process too computational consuming. The search strategy is how to efficiently search in the search space. The performance evaluation strategy is the methods that are used to evaluate the network. Here, the evaluation is more tricky because in order to evaluate a neural network, we need to train it first, and training takes time. So it is important to define a proxy task that can help us better evaluate a network. Here, this paper will tackle these problems with a new hierarchical representation.<br />
<br />
Lately, the use of algorithms for finding optimal DNN architectures has attracted the attention of researchers who have tackled the problem through four main groups of techniques. The first such method employs a supplementary network called a “Hypernet”, which generates ideal network weights given a random architecture. There are two main parts to generating an “optimal” architecture. First, we train the HyperNet. One training cycle consists of generating a random architecture from a sample space of allowed architectures and generating its predicted weights with the HyperNet. Then, the validation score of this proposed network is calculated, and the error is used to backpropagate through the HyperNet. In this manner, the HyperNet can learn to assign robustly optimal initial weights to a given architecture. At “test” time, we generate a random sample of architectures and predict initialized weights for each with our tuned HyperNet. We take the model with the highest validation score and train it as we would a regular architecture. We use this heuristic of “initial validation error” as the relative performance of networks typically stays constant throughout training. That is, if it starts off better, it will very likely end better. The second technique is Monte Carlo Tree Search (MCTS) which repeatedly narrows the search space by focusing on the most promising architectures previously seen. The third group of techniques use evolutionary algorithms where fitness criteria are applied to filter the initial population of DNN candidates, then new individuals are added to the population by selecting the best-performing ones and modifying them with one or several random mutations as in [https://arxiv.org/abs/1703.01041 [Real, 2017]]. The fourth and last group of techniques implement Reinforcement Learning where a policy based controller seeks to optimize the expected accuracy of new architectures based on rewards (accuracy) gained from previous proposals in the architecture space. From these four groups of techniques, Reinforcement Learning has offered the best experimental results; however, the paper we are summarizing implements evolutionary algorithms as its main approach.<br />
<br />
Despite the technique used to look for an optimal architecture, searching in the architecture space usually requires the training and evaluation of many DNN candidates; therefore, it demands huge computational resources and poses a significant limitation for practical applications. Consequently, most techniques narrow the search space with predefined heuristics, either at the beginning or dynamically during the searching process. In the paper we are summarizing, the authors reduce the number of feasible architectures by forcing a hierarchical structure between network components. In other words, each DNN suggested as a candidate is formed by combining basic building blocks to form small modules, then the same basic structures introduced on the building blocks are used to combine and stack networks on the upper levels of the hierarchy. This approach allows the searching algorithm to sample highly complex and modularized networks similar to Inception or ResNet.<br />
<br />
Despite some weaknesses regarding the efficiency of evolutionary algorithms, this study reveals that in fact, these techniques can generate architectures which show competitive performance when a narrowing strategy is imposed over the search space. Accordingly, the main contributions of this paper is a well-defined set of hierarchical representations which acts as the filtering criteria to pick DNN candidates and a novel evolutionary algorithm which produces image classifiers that achieve state of the art performance among similar evolutionary-based techniques.<br />
<br />
=Architecture representations=<br />
<br />
==Flat architecture representation==<br />
All the evaluated network architectures are directed acyclic graphs with only one source and one sink. Each node in the network represents a feature map and consequently, each directed edge represents an operation that takes the feature map in the departing node as input and outputs a feature map on the arriving node. Under the previous assumption, any given architecture in the narrowed search space is formally expressed as a graph assembled by a series of operations (edges) among a defined set of adjacent feature maps (nodes).<br />
<br />
[[File:flatarch.PNG | 650px|thumb|center|Flat architecture representation os neural networks]]<br />
<br />
Multiple primitive operations defined in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Primitive_operations section 2.3] are used to form small networks defined as ''motifs'' by the authors. To combine the outputs of multiple primitive operations and guarantee a unique output per motif the authors introduce a merge operation which in practice works as a depthwise concatenation that does not require inputs with the same number of channels.<br />
<br />
Accordingly, these motifs can also be combined to form more complex motifs on a higher level in the hierarchy until the network is complex enough to perform competitively in challenging classification tasks.<br />
<br />
==Hierarchical architecture representation==<br />
<br />
The composition of more complex motifs based on simpler motifs at lower levels allows the authors to create a hierarchy-like representation of very complex DNN starting with only a few primitive operations as shown in Figure 1. In other words, an architecture with <math> L </math> levels has only primitive operations at its bottom and only one complex motif at its top. Any motif in between the bottom and top levels can be defined as the composition of motifs in lower levels of the hierarchy.<br />
<br />
Formally, the <math>m</math>-th motif in level <math>l</math>, <math>o_m^{(l)}</math>, is recursively defined as the composition of lower-level motifs <math>\textbf{o}^{(l-1)}</math> according to its network structure.<br />
<br />
<center><math> o_m^{(l)}=assemble(G_m^{(l)}, \textbf{o}^{(l-1)})</math></center><br />
<br />
[[File:hierarchicalrep.PNG | 700px|thumb|center|Figure 1. Hierarchical architecture representation]]<br />
<br />
In figure 1, the architecture of the full model (its flat structure) is shown in the top right corner. The input (source) is the bottom-most node. The output (sink) is the topmost node. The paper presents an alternative hierarchical view of the model shown on the left-hand side (before the assemble function). This view represents the same model in three layers. The first layer is a set of primitive operations only (bottom row, middle column). In all other layers component motifs (computational graphs) G are described by an adjacency matrix and a set of operations. The set of operations are from the previous layer. An example motif <math> G^{(2)}_{1}</math> in the second layer is shown in the bottom row (left and middle columns). There are three unique motifs in the second layer. These are shown in the middle layer of the top row. Note that the motifs in the previous layer become the operations in the next layer. The higher layer can use these motifs multiple times. Finally, the top level graph, which contains only one motif, <math> G^{(3)}_{1}</math>, is shown in the top row left column. Here, there are 4 nodes with 6 operations defined between them.<br />
<br />
==Primitive operations==<br />
<br />
The six primitive operations used as building blocks for connecting nodes in either flat or hierarchical representations are:<br />
* 1 × 1 convolution of C channels<br />
* 3 × 3 depthwise convolution<br />
* 3 × 3 separable convolution of C channels<br />
* 3 × 3 max-pooling<br />
* 3 × 3 average-pooling<br />
* Identity mapping<br />
<br />
The authors argue that convolution operations involving larger receptive fields can be obtained by the composition of lower-level motifs with smaller receptive fields. Accordingly, convolution operations considering a large number of channels can be generated by the depthwise concatenation of lower-level motifs. Batch normalization and ''ReLU'' activation function are applied after each convolution in the network. There is a seventh operation called null and is used in the adjacency matrix <math> G </math> to state explicitly that there are no operations between two nodes.<br />
<br />
<br />
Side note:<br />
<br />
Some explanations for different types for convolution:<br />
<br />
* Spatial convolution: Convolutions performed in spatial dimensions - width and height.<br />
* Depthwise convolution: Spatial convolution performed independently over each channel of an input.<br />
* 1x1 convolution: Convolution with the kernel of size 1x1<br />
<br />
[[File:convolutions.png | 350px|thumb|center]]<br />
<br />
=Evolutionary architecture search=<br />
<br />
Before moving forward we introduce the concept of genotypes in the context of the article. In this article, a genotype is a particular neural network architecture defined according to the components described in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_representations section 2]. In order to make the NN architectures ''evolve'' the authors implemented a three stages process that includes establishing the permitted mutations, creating an initial population and make them compete in a tournament where only the best candidates will survive.<br />
<br />
==Mutation==<br />
<br />
One mutation over a specific architecture is a sequence of five changes in the following order:<br />
<br />
* Sample a level in the hierarchy, different than the basic level.<br />
* Sample a motif in that level.<br />
* Sample a successor node <math>(i)</math> in the motif.<br />
* Sample a predecessor node <math>(j)</math> in the motif.<br />
* Replace the current operation between nodes <math>i</math> and <math>j</math> from one of the available operations.<br />
<br />
The original operation between the nodes <math>i</math> and <math>j</math> in the graph is defined as <math> [G_{m}^{\left ( l \right )}] _{ij} = k </math>. Therefore, a mutation between the same pair of nodes is defined as <math> [G_{m}^{\left ( l \right )}] _{ij} = {k}' </math>.<br />
<br />
The allowed mutations include:<br />
# Change the basic primitive between the predecessor and successor nodes (ie. alter an existing edge): if <math>o_k^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} \neq >o_k^{(l-1)}</math><br />
# Add a connection between two previously unconnected nodes. The connection between the node can have any of the six possible primitives: if <math>o_k^{(l-1)}=none</math> and <math>o_{k'}^{(l-1)} \neq none</math><br />
# Remove a connection between existing nodes: if <math>o_k^{(l-1)} \neq none</math> and <math>o_{k'}^{(l-1)} = none</math><br />
<br />
==Initialization==<br />
<br />
An initial population is required to start the evolutionary algorithm; therefore, the authors introduced a trivial genotype (candidate solution, the hierarchical architecture of the model) composed only of identity mapping operations. Then a large number of random mutations was run over the ''trivial genotype'' to simulate a diversification process. The authors argue that this diversification process generates a representative population in the search space and at the same time prevents the use of any handcrafted NN structures. Surprisingly, some of these random architectures show a performance comparable to the performance achieved by the architectures found later during the evolutionary search algorithm.<br />
<br />
==Search algorithms==<br />
<br />
Tournament selection and random search are the two search algorithms used by the authors. <br />
<br />
=== Tournament Selection ===<br />
In one iteration of the tournament selection algorithm, 5% of the entire population is randomly selected, trained, and evaluated against a validation set. Then the best performing genotype is picked to go through the mutation process and put back into the population. No genotype is ever removed from the population, but the selection criteria guarantee that only the best performing models will be selected to ''evolve'' through the mutation process.<br />
<br />
We define the pseudocode for tournament selection as follows:<br />
<br />
1. Choose k (the tournament size) individuals from the population at random<br />
<br />
2. Choose the best individual from the tournament with probability p<br />
<br />
3. Choose the second best individual with probability p*(1-p)<br />
<br />
4. Choose the third best individual with probability p*((1-p)^2)<br />
<br />
5. Continue until the number of selected individuals equal the number we desire.<br />
<br />
Tournament selection is often chosen over alternative genetic algorithms due to the following benefits: it is efficient to code, works on parallel architectures and allows the selection pressure to be easily adjusted.<br />
<br />
=== Random Search ===<br />
In the random search algorithm every genotype from the initial population is trained and evaluated, then the best performing model is selected. In contrast to the tournament selection algorithm, the random search algorithm is much simpler and the training and evaluation process for every genotype can be run in parallel to reduce search time. This algorithm is not widely studied in the literature yet.<br />
<br />
==Implementation==<br />
<br />
To implement the tournament selection algorithm two auxiliary algorithms are introduced. The first is called the controller and directs the evolution process over the population, in other words, the controller repeatedly picks 5% of genotypes from the current population, send them to the tournament and then apply a random mutation over the best performing genotype from each group. <br />
<br />
[[File:asyncevoalgorithm1.PNG | 700px|thumb|center|Controller]]<br />
<br />
The second auxiliary algorithm is called the worker and is in charge of training and evaluating each genotype, a task that must be completed each time a new genotype is created and added to the population either by an initialization step or by an evolutionary step.<br />
<br />
[[File:asyncevoalgorithm2.PNG | 700px|thumb|center|Worker]]<br />
<br />
Both auxiliary algorithms work together asynchronously and communicate each other through a shared tabular memory file where genotypes and their corresponding fitness are recorded.<br />
<br />
=Experiments and results=<br />
<br />
==Experimental setup==<br />
<br />
Instead of a looking for a complete NN model, the search framework introduced in [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_representations section 2] is applied to look for the best performing architectures of a small neural network module called the convolutional cell. Using small modules as building blocks to form a larger and more complex model is an approach proved to be successful in previous cases such as the Inception architecture. Additionally, this approach allowed the authors to evaluate cell candidates efficiently and scale to larger and more complex models faster.<br />
<br />
In total three models were implemented as hosts for the experimental cells, the first two use the CIFAR-10 dataset and the third uses the ImageNet dataset. The search framework is implemented only in the first host model to look for the best performing cells ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2]), once found, these cells were inserted into the second and third host models to evaluate overall performance on the respective datasets ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3]).<br />
<br />
The terms training time step, initialization time step, and evolutionary time step will be used to describe some parts of the experiments. Be aware that these three terms have different meanings; however, each term will be properly defined when introduced.<br />
<br />
==Architecture search on CIFAR-10==<br />
<br />
The overall goal in this stage is to find the best performing cells. The search framework is run using the small CIFAR-10 depicted in Figure 2 as host model for the cells; therefore, during the searching process, only the cells change while the rest of the host model’s structure remains the same. In the context of the evolutionary search algorithm, a cell is also called a candidate or a genotype. Additionally, on every time step during the search process, the three cells in the model will share the same structure and consequently every time a new candidate architecture is evaluated the three cells will simultaneously adopt the new candidate’s architecture.<br />
<br />
[[File:smallcifar10.PNG | 350px|thumb|center|Figure 2. Small CIFAR-10 model]]<br />
<br />
To begin the architecture searching process an initial population of genotypes is required. Random mutations are applied over a trivial genotype to generate a candidate and grow the seminal population. This is called an initialization step and is repeated 200 times to produce an equivalent number of candidates. Creating these 200 candidates with random structures is equivalent to running a random search over a constrained architecture space. <br />
<br />
Then, the evolutionary search algorithm takes over and runs from timestep 201 up to time step 7000, these are called evolutionary timesteps. On each evolutionary time step, a group of genotypes equivalent to 5% of the current population is selected randomly and sent to the tournament for fitness computation. To perform a fitness evaluation each candidate cell is inserted into the three predefined positions within the small CIFAR-10 host model. Then for each candidate cell, the host model is trained with stochastic gradient descent during 5000 training steps and decreasing learning rate. Due to observing a standard deviation of up to 0.2% when evaluating the exact same model, the overall fitness is obtained as the average of four training-evaluation runs. This variance is due to optimization. Finally, a random mutation is applied over a copy of the best cell within the group to create a new genotype that is added to the current population.<br />
<br />
The fitness of each evaluated genotype is recorded in the shared tabular memory file to avoid recalculation in case the same genotype is selected again in a future evolutionary time step.<br />
<br />
The search framework is run for 7000-time steps (200 initialization time steps and the rest are evolutionary time steps) for each one of three different types of cell architecture, namely hierarchical representation, flat representation and flat representation with constrained parameters. <br />
<br />
* A cell that follows a hierarchical representation has NN connections at three different levels; at the bottom level it has only primitive operations, at the second level it contains motifs with four-nodes and at the third level it has only one motif with five-nodes.<br />
<br />
* A cell that follows a flat representation has 11 nodes with only primitive operations between them. These cells look similar to level 2 motifs but instead of having four nodes they have 11 and therefore many more pairs of nodes and operations.<br />
<br />
* For a cell that follows a flat representation with constrained parameters the total number of parameters used by its operations cannot be superior to the total number of parameters used by the cells that follow a hierarchical representation.<br />
<br />
Figure 3 shows the current fitness achieved by the best performing cell from each one of the three types of cells when plugged in the small CIFAR-10 model. Even though the fitness grows rapidly after the first 200 (initialization) time steps, it tends to plateau between 89% to 90%. Overall, cells that follow a flat representation without restriction in the number of parameters tend to perform better than those following a hierarchical structure. It could be due to the fact that the flat representation allows more flexibility when adding connections between nodes, especially between distant ones. Unfortunately, the authors do not describe the architecture of the best performing flat cell.<br />
<br />
[[File:currentfitness.PNG | 300px|thumb|center|Figure 3. Current fitness]]<br />
<br />
Figure 4 presents the maximum fitness reached by any cell seen by the search framework between each one of the three types of cells, the fitness at time step 200 is, therefore, equivalent to the best model obtained by a random search over 200 architectures from each type of cell.<br />
<br />
[[File:maxfitness.PNG | 300px|thumb|center|Figure 4. Maximum fitness]]<br />
<br />
The total number of parameters used by each genotype at any given time step is shown in Figure 5. It suggests that flat representations tend to add more connections over time and most likely those connections correspond to convolutional operations which in turn require more parameters than other primitive operations.<br />
<br />
[[File:numparameters.PNG | 300px|thumb|center|Figure 5. Number of parameters]]<br />
<br />
To run each time step (either initialization or evolutionary) in the search framework, it takes one hour for a GPU to perform four training and evaluation rounds for every single candidate. Therefore, the authors used 200 GPUs simultaneously to complete 7000-time steps in 35 hours. Considering the three types of cell (hierarchical, flat, and parameter-constrained flat), approximately 20000 GPU-hours could be required to replicate the experiment.<br />
<br />
==Architecture evaluation on CIFAR-10 and ImageNet==<br />
<br />
Once the evolutionary search finds the best-fitted cells those are plug into the two larger host models to evaluate their performance in those more complex architectures. The first large model (Figure 6) is targeted to image classification on the CIFAR-10 dataset and the second model (Figure 7) is focused on image classification on the ImageNet dataset. Although all the parameters in these two larger host models are trained from scratch including those within the cells, no changes in the cell’s architectures will happen since their structure was found to be optimal during the evolutionary search.<br />
<br />
The large CIFAR-10 model is trained with stochastic gradient descent during 80K training steps and decreasing learning rate. To account for the non-negligible standard deviation found when evaluating the exact same model, the percentage of error is determined as the average of five training-evaluation runs.<br />
<br />
[[File:largecifar10.PNG | 500px|thumb|center|Figure 6. Large CIFAR-10 model]]<br />
<br />
The ImageNet model is trained with stochastic gradient descent during 200K training steps and decreasing learning rate. For this model, neither standard deviation nor multiple training-evaluation runs were reported.<br />
<br />
[[File:imagenetmodel.PNG | 600px|thumb|center|Figure 7. ImageNet model]]<br />
<br />
In [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2] three types of cells were described: hierarchical, flat, and parameter-constrained flat. For the hierarchical type of cells, the percentage of error in both large models is reported in Table 1 for four different cases: a cell with random architecture, the best-fitted cell from 200 random architectures, the best-fitted cell from 7000 random architectures, and the best-fitted cell after 7000 evolutionary steps. On the other hand, for the flat and parameter-constrained flat types of architecture, only some of the mentioned four cases are reported in Table 1.<br />
<br />
[[File:comparisoncells.PNG | 750px|thumb|center|Table 1. Comparison between types of cells and searching method]]<br />
<br />
According to the results in Table 1, for both large host models, the hierarchical cell found by the evolutionary search algorithm achieved the lowest errors with 3.75% in CIFAR-10, 20.3% top-1 error and 5.2% top-5 error in ImageNet. The errors reported in both datasets are calculated by using the trained large models on test sets of images never seen before during any of the previous stages. Even though the cell that follows a hierarchical representation achieved the lowest error, the ones showing the lowest standard deviations are those following a flat representation.<br />
<br />
The performance achieved by the large CIFAR-10 host model using the best cell is then compared against other classifiers in Table 2. As an additional improvement, the authors increased the number of channels in its first convolutional layer from 64 to 128. It is worth to note that this first convolutional layer is not part of the cell obtained during the evolutionary search process, instead, it is part of the original host model. The results are grouped into three categories depending on how the classifiers involved in the comparison were created, from top to bottom: handcrafted, reinforcement learning, and evolutionary algorithms.<br />
<br />
[[File:comparisonlargecifar10.PNG | 500px|thumb|center|Table 2. Comparison against other classifiers on CIFAR-10]]<br />
<br />
The classification error achieved by the ImageNet host model when using the best cell is also compared against some high performing image classifiers in the literature and the results are presented in Table 3. Although the classification error scored by the architecture introduced in this paper is not significantly lower than those obtained by state of the art classifiers, it shows outstanding results considering that it is not a hand engineered structure.<br />
<br />
[[File:comparisonimagenet.PNG | 500px|thumb|center|Table 3. Comparison against other classifiers on ImageNet]]<br />
<br />
A visualization of the evolved hierarchical cell is shown below. The detailed visualizations of each motif can be seen in Appendix A of the paper. It can be noted that motif 4 directly links the input and output, and itself contains (among other operations) an identity mapping from input to output. Many other such 'skip connections' can be seen.<br />
<br />
[[File:WF_SecCont_03_hier_vis.png]]<br />
<br />
=Conclusion=<br />
<br />
A new evolutionary framework is introduced for searching neural network architectures over searching spaces defined by flat and hierarchical representations of a convolutional cell, which uses smaller operations instead of the larger ones as the building blocks. Experiments show that the proposed framework achieves competitive results against state of the art classifiers on the CIFAR-10 and ImageNet datasets.<br />
<br />
Also, compared to contemporary RL-based architecture search approaches, the proposed approach is generally faster with comparable performance.<br />
<br />
=Critique=<br />
<br />
While the method introduced in this paper achieves a lower error in comparison to other evolutionary methods, it is not significantly better than those obtained by handcrafted design or reinforcement learning. A more in-depth analysis considering the number of parameters and required computational resources would be necessary to accurately compare the listed methods. I believe they could have described more about the advantages over reinforcement learning. <br />
<br />
The paper does not provide enough reasons why the author chose specific two searching algorithms. Possibly more efficient searching is available, which can lead to better performance. Especially, when the performance of the algorithm is not significantly better than previous handcrafted ones, this can be a possible technical improvement.<br />
<br />
In [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3] it is not clear why the results for the four different cases that are reported for the hierarchical cells in Table 1 are not reported for the ones following a flat representation, considering that the flat cells showed a better performance during the evolutionary search. Recall that the four cases are: a cell with random architecture, the best-fitted cell from 200 random architectures, the best-fitted cell from 7000 random architectures, and the best-fitted cell after 7000 evolutionary steps.<br />
<br />
It seems contradictory that the flat type of cells who clearly performed better than the hierarchical ones during the architecture search ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_search_on_CIFAR-10 section 4.2]) are not the ones scoring the lowest error when evaluated on the two large host models ([https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Representations_for_Efficient_Architecture_Search#Architecture_evaluation_on_CIFAR-10_and_ImageNet section 4.3]).<br />
<br />
= References =<br />
<br />
# Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, Koray Kavukcuoglu, https://arxiv.org/abs/1711.00436.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering&diff=42371DCN plus: Mixed Objective And Deep Residual Coattention for Question Answering2018-12-10T23:44:48Z<p>Msminhas: Editorial</p>
<hr />
<div>== Introduction ==<br />
Question Answering(QA) is one of the challenging computer science tasks that need an understanding of the natural language and the ability to reason efficiently. To accurately answer the question, the model must first have a detailed understanding of the context the question is being asked from. Because the questions are usually very detailed, having a shallow knowledge from the context would lead to poor and unacceptable performance. Moreover, the model should gather all the information provided in the question and match them with its knowledge from the context. Generating the answer is another interesting task. Based on the dataset the model is meant for, the output of the model might be in a completely different form.<br />
In the past years, QA datasets have improved significantly. Previous datasets were really simple and they usually did not simulate a real-world question-answer pair. For example, Children's book test was one of the popular datasets that have been used for QA for a long time. But the real task for this dataset was to just fill empty spaces in given sentences with the appropriate words. During the past years, the importance of the QA tasks and their practical uses encouraged many to gather and crowdsource useful and more realistic datasets. The Stanford Question Answering Dataset(SQuAD), Microsoft Machine Reading Comprehension Dataset(MS MARCO), and Visual Question Answering Dataset(VQA) are only a few examples of the currently advanced datasets.<br />
As a result of these advancements, many researchers are focusing to improve the performance of the question answering models on these datasets. Deep neural networks were able to outperform the human accuracy on a few of these datasets, but in many cases, there is still a gap between the state-of-the-art and human performance. Previously, Dynamic Coattention Networks(DCN) proved to be efficient on the SQuAD, achieving state-of-the-art performance at the time. In this work, a further modification to DCN has been done which improves the accuracy of the model by proposing a mixed objective that combines cross entropy loss with self-critical policy learning. Moreover, the rewards used are based on the word overlap to find a solution for the evaluation metric and objective misalignment.<br />
<br />
==Overview of previous work==<br />
Most of the current QA models are made from different modules and usually stacked on top of each other. Improving one of the modules would lead to an overall performance of the model. Thus, to evaluate the efficiency of an improvement, researchers usually take a previously submitted model and replace their own improved module with the current one in the model. This is mostly because QA is an interesting discipline and has practical uses.<br />
<br />
The state of the art approaches to this problem can be divided into 3. <br />
<br />
1. Neural Models for question answering: Models like coattention, bidirectional attention flow and self-matching attention build codependent representations of the question and the document. After building these representations, the models predict the answer by generating the start position and the end position corresponding to the estimated answer span. The generation process utilizes a pointer network. Another approach uses a dynamic decoder that iteratively proposes answers by alternating between the start position and end position estimates, which in some cases allows it to recover from initial mistakes in predictions. <br />
<br />
2. Neural Attention Models: Models like self-attention have been applied to language modeling and sentimental analysis. A deep version of the same called deep self-attention networks attained state-of-the-art results in machine translation. For an intuitive understanding of attention based neural network models refer to the following links: [https://youtu.be/SysgYptB198 Attention Models Intuition 1] [https://youtu.be/quoGRI-1l0A Attention Models Intuition 2]. Coattention, bidirectional attention and self-matching attention are some of the methods that build codependent representation between the question and the document. <br />
<br />
3. Reinforcement learning in NLP: Hierarchical RL techniques have been proposed for generating text in a simulated way finding domain. DQN have been used to learn policies in text-based games using game rewards as feedback. Neural conversational model has been proposed, that is trained using policy gradient methods, whose reward function consisted of heuristics for ease of answering, information flow, and semantic coherence. General actor-critic temporal-difference methods for sequence prediction have also been experimented, performing metric optimization on language modeling and machine translation. Direct word overlap metric optimization has also been applied to summarization and machine translation.<br />
<br />
==Important Terms==<br />
<br />
#Embedding layer: This layer maps each word (or images in the case of visual QA) to a vector space. There are many options to choose for the embedding layer. While pre-trained GloVes or Word2Vecs showed promising results on many tasks, most models use a combination of GloVe and character level embeddings. The character level embeddings are especially useful when dealing with out-of-vocab words. In the case of dealing with images, the embeddings are usually generated using pre-trained ResNets. Using different embedding layers for images has shown to change the overall performance of the model drastically.<br />
#Contextual_layer: The purpose of this layer is to add more features to each word embedding based on the surrounding words and the context. This layer is not presented in many models including the DCN.<br />
#Attention layer: There has been a lot of investigation on the attention mechanisms in recent years. These works, mostly inspired by Bahdanau et al. (2014), try to either modify the basic matrix-based attention mechanism or to develop innovative ones. The sole purpose of the attention mechanism is to make the model able to understand a context, based on the information gathered from somewhere else. For example, in image-based QA, attention layer helps the model to understand the question based on the information provided in the image such as object classes. This way, the model can realize what parts of the question are more important. This model uses '''co-attention layers''' (Xiong, 2017). Given two inputs sources (text and question), internal representations are built conditioned on one of the sources. In a way, this can be thought of as retaining (attending to) parts of the input that is relevant to the other source. From the text, only parts that are 'useful' to the question are kept, while from the question, parts that are useful for the text are retained. The intuition stems from the fact that it is easier to answer a question from a text, knowing the question beforehand compared with when the question is only available at the end. In the former, only information relevant to the question is kept, while in the latter case, all information from the text needs to be kept.<br />
#Output layer: This is the final layer of all models, generating the answer of the question based on the information provided from all the previous layers.<br />
<br />
==DCN+ structure==<br />
The DCN+ is an improvement on the previous DCN model. The overall structure of the model is the same as before. The first improvement is on the coattention module. By introducing a deep residual coattention encoder, the output of the attention layer becomes more feature-rich. The second improvement is achieved by mixing the previous cross-entropy loss with reinforcement learning rewards from self-critical policy learning. DCN+ has a decoder module that is only applicable to the SQuAD dataset since the decoder only predicts an answer span from the given context.<br />
<br />
===Deep residual coattention encoder===<br />
The previous coattention module was unable to grasp complex information based on the context and the question. Recent studies showed that stacked attention mechanisms are outperforming the single layer attention modules. In DCN+, the coattention module is stacked to make it able to self-attend to the context and grasp more information. The second modification is to use residual connectors when merging the coattention output from each layer.<br />
<br />
[[File:Coattention.png|700px|centre]]<br />
<br />
let <math>L^D \in R^{m×d}</math> and <math>L^Q \in R^{n×d}</math> denote the word embedding for the context and the question respectively. Here, <math>d, m, n</math> are the embedding vector size, document word count, and question word count respectively. The model uses a bidirectional LSTM as the contextual layer with shared wights. Also, an additional sentinel token is added at the end of the document and question to make it possible for the model to distinguish between the document and question. <math>E^D</math> and <math>E^Q</math> are outputs of the encoder(contextual) layer.<br />
<br />
\begin{align}<br />
E_1^D = BiLSTM_1(L^D) \in R^{(h×(m+1))}<br />
\end{align}<br />
\begin{align}<br />
E_1^Q = tanh(W BiLSTM_1(L^Q) \in R^{(h×(n+1))}<br />
\end{align}<br />
<br />
Here <math>h</math> is the hidden size of the LSTM. The affinity matrix is created based on the output of the encoder. The affinity matrix is the matrix that the has been used in the attention module from the introduction of attention. By performing a column-wise softmax function on the affinity matrix a vector would be generated that is a representation of the importance of each question token, based on the model's understanding of the context. Similarly, if a row-wise softmax function is applied to the affinity matrix, the output vector would represent the importance of each context word, based on the question. By multiplying these vectors to the outputs of the encoder layer, question-aware context and context-aware question representations would be created.<br />
<br />
\begin{align}<br />
A = {(E_1^D)}^T E_1^Q \in R^{(m+1)×(n+1)}<br />
\end{align}<br />
\begin{align}<br />
{S_1^D} = E_1^Q softmax(A^T) \in R^{h×(m+1)}<br />
\end{align}<br />
\begin{align}<br />
{S_1^Q} = E_1^D softmax(A) \in R^{h×(n+1)}<br />
\end{align}<br />
<br />
To make the question-aware context representation even deeper and more feature-rich, an output (called the co-attention context, <math> C_1^D </math>) of the first co-attention layer is fed directly into the decoder using a residual connection. <br />
<br />
\begin{align}<br />
{C_1^D} = S_1^Q softmax(A^T) \in R^{h×m}<br />
\end{align}<br />
<br />
Note that the model drops the dimension corresponding to the sentinel vector. The summaries also get encoded after this stage, using two bidirectional LSTMs with shared variables.<br />
<br />
\begin{align}<br />
{E_2^D} = BiLSTM_2(S_1^Q) \in R^{2h×m}<br />
\end{align}<br />
\begin{align}<br />
{E_2^Q} = BiLSTM_2(S_1^D) \in R^{2h×n}<br />
\end{align}<br />
<br />
Finally, The <math>E_2^D</math> and <math>E_2^Q</math> are fed into the second co-attention layer. Similar to the first co-attention layer, three outputs are produced, <math>S_2^D, S_2^Q, C_2^D </math>. However, <math>S_2^Q</math> is not used. These co-attention modules can easily get stacked to create a deeper attention mechanism. <br />
<br />
The output of the second co-attention layer are concatenated with residual connections from <math>C_1^D, S_1^D, E_2^D</math>. The final output of model is obtained by passing the concatenated representation through another bi-direction LSTM:<br />
<br />
\begin{align}<br />
U = BiLSTM(concat(E_1^D;E_2^D;S_1^D;S_2^D;C_1^D;C_2^D) \in R^{2h×m}<br />
\end{align}<br />
<br />
===Mixed objective using self-critical policy learning===<br />
DCN produces a distribution over that start and end positions of the answer span. Because of the dynamic nature of the decoder module, it estimates separate distributions over the start and end position of the answer dynamically.<br />
<br />
\begin{align}<br />
l_{ce}(\theta) = - \sum_{t} (log \ p_t^{start}(s|s_{t-1},e_{t-1};\theta) + log \ p_t^{end}(e|s_{t-1},e_{t-1};\theta))<br />
\end{align}<br />
<br />
In the above equation, <math>s</math> and <math>e</math> denote the respective start and end points of the ground truth answer. <math>s_t</math> and <math>e_t</math> denote the greedy estimation of the start and end positions at the <math>t</math>th decoding time step. Similarly, <math>p_t^{start} \in R^m</math> and <math>p_t^{end} \in R^m</math> denote the distribution of the start and end positions respectively. The problem with the above loss functions is that it does not consider the F1 metric for evaluation of the model. There are two metrics to estimate QA models accuracy. The first metric is the exact match and it is a binary score. If the answer string does not match with the ground truth answer even by a single character, the exact match score would be zero. The second metric is the F1 score. F1 score is basically the degree of the overlap between the predicted answer and the ground truth. <br />
For example, suppose there are more than two correct answer spans in a context, <math>A</math> and <math>B</math>, but none of the match the ground truth positions. If A has an exact string match but B does not, The cross-entropy loss would penalize both of them equally. However, if we include can F1 scores in our calculations, the loss function would penalize B and not A. <br />
<br />
The main problem with including F1 score directly into cost functions is that it is non-differentiable. A trick from (Sutton et al.,1999; Schulman et al., 2015) is used to approximate the expected gradient. <br />
For this, DCN+ uses a self-critical reinforcement learning objective.<br />
<br />
\begin{align}<br />
l_{rl}(\theta) = -E_{\hat{\tau} \sim p_\tau} [R(s,e,\hat{s}_T,\hat{e}_T;\theta)]<br />
\end{align}<br />
<br />
\begin{align}<br />
\approx -E_{\hat{\tau} \sim p_\tau} [F_1 (ans(\hat{s}_T, \hat{e}_T), ans(s, e)) - F_1(ans(s_T, e_T), ans(s, e))]<br />
\end{align}<br />
<br />
Here <math>\hat{s} \sim p_t^{start}</math> and <math>\hat{e} \sim p_t^{end}</math> denote the sampled start and end positions respectively from the estimated distributions at <math>t</math>th decoding step. <math>\hat{\tau}</math> is the sequence of sampled start and end positions during all <math>T</math> decoder steps, <math>R</math> is the expected reward, <math>F_1</math> is the F1 score between the predicted answer and the expected answer. Rather than using the raw F1 score, mean subtracted F1 score is used (baseline). Previous studies show that using a baseline for the reward reduces the variance of gradient estimates and facilitates convergence. DCN+ uses a self-critic that uses the F1 produced during greedy inference by the current model.<br />
<br />
[[File:loss.png|700px|centre]]<br />
<br />
==Dataset==<br />
The dataset SQuAD (Reference: Stanford NP Group) was used in training the network. The SQuAD 1.1 dataset contains 100 000 questions, based on a set of Wikipedia articles. These questions are designed to be answered by a segment of text from the article. The solutions to each question are represented by a start location, and the text of the answer. An example question-answer pair of the SQuAD 2.0 dataset is: Q: "When did Beyonce start becoming popular?", A: (text: "in the late 1990s", start: 269).<br />
<br />
The SQuAD 2.0 dataset augments the SQuAD 1.1 collection with 50 000 unanswerable questions, designed in an adversarial manner. Samples were generated by crowdworkers in both cases.<br />
<br />
==Experiments==<br />
To achieve optimal performance, the hyperparameters and training environment are fine-tuned. The hyperparameters of DCN are duplicated. The model was trained and evaluated using the Stanford Question Answering Dataset (SQuAD). For tokenizing the documents, the Stanford CoreNLP reversible tokenizers has been used. For word embeddings, a pre-trained GloVE (trained on 840B common crawl), as well as character ngram embeddings by Hashimoto et al. (2017), is used. Furthermore, these embeddings are then concatenated with context vectors (CoVe) trained on WMT. Words which are not found in the vocabulary have their embedding and context vectors set to zero. The optimizer has been set to Adam and a dropout is also applied on word embeddings that zeros a word embedding with a probability of 0.075. PyTorch is used to build the model.<br />
<br />
==Results==<br />
At the time of submission, the model was able to achieve state-of-the-art results on the SQuAD, outperforming the second model on the leaderboard by 2.0% both on the exact match and F1 scores. It is worth mentioning that a 5% improvement was also achieved with respect to the original DCN model.<br />
<br />
[[File:dcn_resutls1.png|700px|centre]]<br />
<br />
In general, DCN+ was able to a achieve consistent performance improvement in almost every question category.<br />
<br />
[[File:dcn_results2.png|700px|centre]]<br />
<br />
The training curves for DCN+ with reinforcement learning and DCN+ without reinforcement learning are shown in Figure 4 to illustrate the effectiveness of our proposed mixed objective. <br />
<br />
[[File:dcn+.png|700px|centre]]<br />
<br />
===Ablation Study===<br />
An analysis of the significance of each part of the model found that the deep residual coattention contributed the most to the overall performance. The second highest contributor was the mixed objective. The sparse mixture of experts layer in the decoder also provided some minor contributions to improving the overall performance.<br />
<br />
==Summary and Critiques==<br />
<br />
This paper introduces a novel model for the task of question answering where the cross-entropy loss commonly used for such problems previously has been combined with self-critical policy learning. The rewards are obtained from word overlap to solve misalignment metric and optimization objective. This paper improves the state of the art in a popular question-answer data set. The critical drawback in this paper is that it only shows experimental improvements on one question answer dataset. Previous works in the same field have considered performances on at least three different comprehensive question answer data sets. This paper is only an incremental improvement over the previous algorithm DCN which was released a year back. For the policy learning objective, the authors consider the task as a multi-task learning problem where the dual losses are linearly combined. The authors should have used a weighted combination instead as the positional match objective using cross entropy is far more important than the word overlap objective with ground truth. Additionally, some methods adopted by the authors are not intuitive and not much explanation is given for the same. For example, it is not very clear why the F1 scores have been used as RL rewards as against some other distance objectives commonly used in previous works in the same field like cross entropy. The authors mention a common problem in using Reinforcement learning in NLP problems. NLP domains are discontinuous and discrete domains which the agents have to repeatedly explore to find a good policy. RL is very data hungry, but NLP domains don't offer sufficient datasets for exploration in most cases. The paper says that it is treating the optimization problem as a multi-task learning problem to get around the exploration problem. It is not clear how this is effected. <br />
<br />
==Other Sources==<br />
# An easy to understand blog on the base DCN model can be found at [https://einstein.ai/research/blog/state-of-the-art-deep-learning-model-for-question-answering].<br />
# Tensorflow Source code for this model can be found at [https://github.com/andrejonasson/dynamic-coattention-network-plus]<br />
<br />
<br />
<br />
==References==<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly<br />
learning to align and translate. In ICLR, 2015.<br />
<br />
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,<br />
Aaron C. Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In<br />
ICLR, 2017.<br />
<br />
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain<br />
questions. In ACL, 2017.<br />
<br />
Nina Dethlefs and Heriberto Cuayahuitl. Combining hierarchical reinforcement learning and ´<br />
bayesian networks for natural language generation in situated dialogue. In Proceedings of the<br />
13th European Workshop on Natural Language Generation, pp. 110–120. Association for Computational<br />
Linguistics, 2011.<br />
<br />
Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient<br />
estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530, 2001.<br />
<br />
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task<br />
model: Growing a neural network for multiple NLP tasks. In EMNLP, 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.<br />
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–<br />
778, 2016.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9 8:<br />
1735–80, 1997.<br />
<br />
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses<br />
for scene geometry and semantics. CoRR, abs/1705.07115, 2017.<br />
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,<br />
abs/1412.6980, 2014.<br />
<br />
Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In NIPS, 1999.<br />
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement<br />
learning for dialogue generation. In EMNLP, 2016.<br />
<br />
Rui Liu, Junjie Hu, Wei Wei, Zi Yang, and Eric Nyberg. Structural embedding of syntactic trees for<br />
machine comprehension. In ACL, 2017.<br />
<br />
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention<br />
for visual question answering. In NIPS, 2016.<br />
<br />
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In ACL, 2014.<br />
<br />
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In NIPS, 2017.<br />
<br />
Microsoft Asia Natural Language Computing Group. R-net: Machine reading comprehension with self-matching networks. 2017.<br />
<br />
Karthik Narasimhan, Tejas D. Kulkarni, and Regina Barzilay. Language understanding for textbased games using deep reinforcement learning. In EMNLP, 2015.<br />
<br />
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. CoRR, abs/1705.04304, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.<br />
<br />
John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, 2015.<br />
<br />
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017.<br />
<br />
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.<br />
<br />
Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1047–1055. ACM, 2017.<br />
<br />
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. <br />
<br />
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.<br />
<br />
Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. In ICLR, 2017.<br />
<br />
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In CoNLL, 2017.<br />
<br />
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.<br />
<br />
Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question<br />
answering. In ICLR, 2017.<br />
<br />
Stanford NLP Group. Squad 2.0: The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/. Accessed October 24, 2018.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs&diff=42370stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs2018-12-10T23:41:07Z<p>Msminhas: Editorial</p>
<hr />
<div>== Introduction ==<br />
The main reason behind the recent success of Long Short-Term Memory Networks (LSTM) and deep neural networks has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. It is not always possible to know how the prediction was made, where it came from and how to understand the workings underneath. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition (CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model, with a core focus of this work being the explainability of a prediction. <br />
<br />
<br />
Contextual Decomposition is the method introduced in this paper. It extracts information about which words contributed the maximum and minimum amounts towards LSTM prediction, and also how they were combined in order to yield the final prediction. The LSTM output is mathematically decomposed and the contributions are disambiguated at each step by different parts of the sentence. In the application domain, this paper shows how the contextual decomposition method is used to successfully extract positive and negative negations from an LSTM. This paper also shows that the prior interpretation methods have document-level<br />
information built into them in complex, unspecified ways. For example, in the prior work, strongly negative phrases contained within positive reviews are viewed as neutral, or even positive.<br />
<br />
==Intuition of the paper==<br />
<br />
If we consider a sentence in an Amazon review stating "This bag was good" it is very clearly a positive sentence, where the word "good" contributes maximum towards the positivity. On the other hand, if the review reads "This bag was not good" this becomes a negative review. Note that the negative review is not highly influenced by any individual word; most of the influence stems from an interaction between two words "not" and "good". This interaction modeled by the authors gives them an extra degree of freedom. Thus, they can have a better interpretation of the model. In this paper, the authors focus on this interaction between words for studying model efficiency and explainability. (Reference, author's talk in: https://www.youtube.com/watch?v=GjpGAyJenCM).<br />
<br />
==Overview of previous work==<br />
<br />
There has been research conducted towards developing methods to understand the evaluations provided by LSTMs. Some of them are in line with the work done in this particular paper while others have followed some different approaches.<br />
<br />
Approaches similar to the one provided in this paper - All of the approaches in this category have tried to look into computing just the word-level importance scores with varying evaluation methods.<br />
#Murdoch & Szlam (2017): introduced a decomposition of the LSTM's output embedding and learned the importance score of certain words and phrases from those words (sum of the importance of words). A classifier is then built that searches for these learned phrases that are important and predicts the associated class which is then compared with the output of the LSTMs for validation.<br />
#Li et al. (2016): (Leave one out) They observed the change in log probability of a function by replacing a word vector (with zero) and studying the change in the prediction of the LSTM. It is completely anecdotal. Additionally provided an RL model to find the minimal set of words that must be erased to change the model's decision (although this is irrelevant to interpretability).<br />
#Sundararajan et al. 2017 (Integrated Gradients): a general gradient based technique to learn importance evaluated theoretically and empirically. Built up as an improvement to methods which were trying to do something quite similar. It is tested on image, text and chemistry models.<br />
<br />
Decomposition-based approaches for CNN:<br />
#Bach et al. 2015: proposed a solution to the problem of understanding classification decisions of CNNs by pixel-wise decomposition. Pixel contributions are visualized as heat maps.<br />
#Shrikumar et al. 2016 (DeepLift): an algorithm based on a method similar to backpropagation to learn the importance score for the inputs for a given output. The algorithm to learn these input scores is not dependent on the gradient therefore learning can also happen if the gradient is zero during backpropagation.<br />
<br />
Focussing on analyzing the gate activations:<br />
#Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. For eg. a cell being activated for keeping track of open parathesis or quotes.<br />
#Strobelt et al. 2016: built a visual tool for understanding and analyzing raw gate activations.<br />
<br />
Attention-based models:<br />
#Bahdanau et al. (2014): These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. But the problem with it is twofold. Firstly, it is only an indirect indicator of importance and does not provide directionality. Secondly, they have not been evaluated empirically or otherwise as an interpretation technique. Although they have been used in multiple other applications and architectures for solving a variety of problems.<br />
<br />
==Long Short-Term Memory Networks==<br />
Over the past few years, LSTM has become a core component of neural NLP systems and sequence modeling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference is that the modules are little more complicated. Instead of having a single tanh layer like in an RNN, they have four (called gates), interacting in a special way. Additionally, they have a cell state which runs through the entire chain of the network. It helps in managing the information from the previous cells in the chain.<br />
<br />
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.<br />
<br />
\begin{align}<br />
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o)<br />
\end{align}<br />
\begin{align}<br />
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f)<br />
\end{align}<br />
\begin{align}<br />
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i)<br />
\end{align}<br />
\begin{align}<br />
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g)<br />
\end{align}<br />
\begin{align}<br />
c_t = f_t \odot c_{t−1} + i_t \odot g_t<br />
\end{align}<br />
\begin{align}<br />
h_t = o_t \odot tanh(c_t)<br />
\end{align}<br />
<br />
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and <math> \odot </math> denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication. Intuitively we can think of the forget gate as how much previous memory(information) do we want to forget; input gate as controlling whether or not to let new input in; g gate controlling what do we want to add and finally the output gate as controlling how much the current information(at current time step) should flow out.<br />
<br />
A visualization of an LSTM can be seen below (Reference, Nvidia Corporation: Long Short-Term Memory (LSTM)). Note that both the input and recurrent elements are seen 'entering' the cell in various locations. The output, and recurrent elements can be seen 'exiting' the cell at the top of the image.<br />
<br />
[[File: WF_SecCont_01_lstm.png|600px|center]]<br />
<br />
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.<br />
<br />
\begin{align}<br />
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) }<br />
\end{align}<br />
<br />
==Contextual Decomposition(CD) of LSTM==<br />
CD decomposes the output of the LSTM into a sum of two contributions:<br />
# those resulting solely from the given phrase<br />
# those involving other factors<br />
<br />
One important thing that is crucial to understand is that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence and returns the vectors with the contributions.<br />
<br />
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.<br />
<br />
\begin{align}<br />
h_t = \beta_t + \gamma_t<br />
\end{align}<br />
\begin{align}<br />
c_t = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
In the decomposition <math>\beta_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase while <math> \gamma_t </math> denotes contribution atleast in part from the other factors. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and atleast in part from the other factors respectively.<br />
<br />
Using this decomposition the softmax function can be represented as follows<br />
\begin{align}<br />
p = SoftMax(W\beta_T + W\gamma_T)<br />
\end{align}<br />
<br />
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.<br />
<br />
===Disambiguation Interaction between gates===<br />
<br />
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.<br />
<br />
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have<br />
\begin{align}<br />
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\<br />
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)<br />
\end{align}<br />
<br />
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).<br />
<br />
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is contained within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>). In this summary, we will derive the equations for the former case. A very important thing to understand is that given any word/phrase within a sentence this algorithm would make a full pass over the LSTM to compute the 2 different contributions.<br />
<br />
So essentially what we are going to do now is linearize each of the gates, then expand the product of sums of these gates and then finally group the terms we get depending on which type of interaction they represent (solely from phase, solely from other factors and a combination of both).<br />
<br />
Terms are determined to be derived solely from the specific phrase if they involve products from some combination of <math>\beta_{t-1}, \beta_{t-1}^c, x_t</math> and <math> b_i </math> or <math> b_g </math>(but not both). In the other case when t is not within the phrase, products involving <math> x_t </math> are treated as not deriving from the phrase. This(the other case) can be observed by seeing the equations for this specific case in the appendix of the original paper.<br />
<br />
\begin{align}<br />
f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\<br />
& = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\<br />
& = \beta_t^f + \gamma_t^f<br />
\end{align}<br />
<br />
Similarly<br />
<br />
\begin{align}<br />
i_t\odot g_t &= [(L_\sigma(W_ix_t) + L_\sigma(V_i\beta_{t-1}) + L_\sigma(V_i\gamma_{t-1}) + L_\sigma(b_i))] \\<br />
& \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(V_g\gamma_{t-1}) + L_{tanh}(b_g))] \\<br />
& = ([L_\sigma(W_ix_t) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_i\beta_{t-1}) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1}) + L_{tanh}(b_g))] \\ <br />
&+ L_\sigma(b_i) \odot [(L_{tanh}(W_gx_t) + L_{tanh}(V_g\beta_{t-1})] \\ <br />
&+ [L_\sigma(V_i\gamma_{t-1}) \odot g_t + i_t \odot L_{tanh}(V_g\gamma_{t-1}) - L_\sigma(V_i\gamma_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1}) \\ <br />
&+ L_\sigma(b_i) \odot L_{tanh}(b_g)] \\<br />
& = \beta_t^u + \gamma_t^u<br />
\end{align}<br />
<br />
Thus we can represent <math>c_t</math> as<br />
<br />
\begin{align}<br />
c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\<br />
& = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\<br />
& = \beta_t^c + \gamma_t^c<br />
\end{align}<br />
<br />
So once we have the decomposition of <math> c_t </math>, then we can rather simply calculate the transformation of <math> h_t </math> by linearizing the <math> tanh</math> function. Again at this point, we just assume that a linearizing function for <math> tanh </math> exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally <math> h_t </math> can be written as<br />
<br />
\begin{align}<br />
h_t &= o_t \odot tanh(c_t) \\<br />
& = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\<br />
& = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\<br />
& = \beta_t + \gamma_t<br />
\end{align}<br />
<br />
===Linearizing activation functions ===<br />
<br />
This section will explain the big assumption that we took earlier about the linearizing functions <math> L_{\sigma} </math> and <math> L_{tanh} </math>. For arbitrary { <math> y_1, ..., y_N </math> } <math> \in R </math>, the problem that we intend to solve is essentially<br />
\begin{align}<br />
tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i)<br />
\end{align}<br />
<br />
In cases where {<math> y_i </math>} follow a natural ordering, work in Murdoch & Szlam, 2017 where the difference of partial sums is utilized as a linearization technique could be used. This could be shown by the equation below<br />
\begin{align}<br />
L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j)<br />
\end{align}<br />
<br />
But in our case the terms do not follow any particular ordering, for e.g. while calculating <math> i_t </math> we could write it as a sum of <math> W_ix_t, V_ih_{t−1}, b_i </math> or <math> b_i, V_ih_{t−1}, W_ix_t </math>. Thus, we average over all the possible orderings. Let <math>\pi_i, ..., \pi_{M_n} </math> denote the set of all permutations of <math>1, ..., N</math>, then the score could be given as below<br />
<br />
\begin{align}<br />
L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k) - 1}y_{\pi_i(j)})]<br />
\end{align}<br />
<br />
We can similarly derive <math> L_{\sigma} </math>. An important empirical observation to note here is that in the case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term.<br />
In our case, the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example of a case where N=2 is shown below.<br />
<br />
\begin{align}<br />
L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)])<br />
\end{align}<br />
<br />
==Experiments==<br />
As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper verifies the following 3 tasks with the experiments:<br />
# It should work on the standard problem of word-level importance scores<br />
# It should behave for words as well as phrases especially in situations involving compositionality.<br />
# It should be able to extract instances of positive and negative negation.<br />
<br />
An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The code can be found at "https://github.com/jamie-murdoch/ContextualDecomposition". The model was trained using Adam with the learning rate of 0.001 and using early stopping on the validation set. Additionally, a bag of words linear model was used.<br />
<br />
All the experiments were performed on the Stanford Sentiment Treebank(SST) [Socher et al., 2013] dataset and the Yelp Polarity(YP) [Zhang et al., 2015] dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. It is important to note that the SST dataset has one key feature that is perfect for this task, which is in addition to review-level labels, <br />
it also provides labels for each phrase in the binarized constituency parse tree. This enables us to examine that if the model can identify negative phrases out of a positive review, or vice versa. The word embedding used in LSTM is pretrained Glove vectors with length equal to 300, and the hidden representations of the LSTM is set to be 168. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In the case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.<br />
<br />
<br />
===Baselines===<br />
The interpretations are compared with 4 state-of-the-art baselines for interpretability.<br />
# Cell Decomposition(Murdoch & Szlam, 2017), <br />
# Integrated Gradients (Sundararajan et al., 2017),<br />
# Leave One Out (Li et al., 2016),<br />
# Gradient times input [gradient of the output probability with respect to the word embeddings is computed which is finally reported as a dot product with the word vector]<br />
<br />
To obtain phrase scores for word-based baselines integrated gradients, cell decomposition, and gradients, the paper sums the scores of the words contained within the phrase.<br />
<br />
<br />
===Unigram(Word) Scores===<br />
Logistic regression(LR) coefficients while being sufficiently accurate for prediction are considered the gold standard for interpretability. For the task of sentiment analysis, the importance of the words is given by their coefficient values. Thus we would expect the CD scores extracted from an LSTM, to have meaningful relationships and comparison with the logistic regression coefficients. This comparison is done using scatter plots(Fig 4) which measures the Pearson correlation coefficient between the importance scores extracted by LR coefficients and LSTM. This is done for multiple words which are represented as a point in the scatter plots. For SST, CD and Integrated Gradients, with correlations of 0.76 and 0.72, respectively, are substantially better than other methods, which have correlations of at most 0.51. On Yelp, the gap is not as big, but CD is still very competitive, having correlation 0.52 with other methods ranging from 0.34 to 0.56. The complete results are shown in Table 4.<br />
[[File:Dhruv Table4.png|600px|centre]]<br />
<br />
===Benefits===<br />
Having verified reasonably strong results in the base case, the paper then proceeds to show the benefits of CD.<br />
====Identifying Dissenting Subphrases====<br />
First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase(a phrase is considered to be of at most 5 words) with different sentiments. For example, consider the phrase "used to be my favorite". The word "favorite" is strongly positive which is also shown by it having a high linear regression coefficient. Nonetheless, the existing methods identify "favorite" as being highly negative or neutral in this context. However, as shown in table 1 CD is able to correctly identify it being strongly positive, and the subphrase "used to be" as highly negative. This particular identification is itself the main reason for using the LSTMs over other methods in text comprehension. Thus, it is quite important that an interpretation algorithm is able to properly uncover how these interactions are being handled. A search across the datasets is done to find similar cases where a negative phrase contains a positive sub-phrase and vice versa. Phrases are scored using the logistic regression over n-gram features and included if their overall score is over 1.5.<br />
[[File:Dhruv Table1.png|600px|centre]]<br />
It is to be noted that for an efficient interpretation algorithm the distribution of scores for these positive and negative dissenting subphrases should be significantly separate with the positive subphrases having positive scores and vice-versa. However, as shown in figure 2, this is not the case with the previous interpretation algorithms.<br />
<br />
====Examining High-Level Compositionality====<br />
The paper now studies the cases where a sizable portion of a review(between one and two-thirds) has a different polarity from the final sentiment. An example is shown in Table 2. SST contains phrase-level sentiment labels too. Therefore the authors conduct a search in SST where a sizable phrase in a review is of the opposite polarity than the review-level label. The figure shows the distribution of the resulting positive and negative phrases for different attribution methods. We should note that a successful interpretation method would have a sizeable gap between these two distributions. Notice that the previous methods fail to satisfy this criterion. The paper additionally provides a two-sample Kolmogorov-Smirnov one-sided test statistic, to quantify this difference in performance. This statistic is a common difference measure for the difference of distributions with values ranging from 0 to 1. As shown in Figure 3 CD gets a score of 0.74 while the other models achieve a score of 0(Cell decomposition), 0.33(Integrated Gradients), 0.58(Leave One Out) and 0.61(gradient). The methods leave one out and gradient perform relatively better than the other 2 baselines but they were the weakest performers in the unigram scores. This inconsistency in other methods performance further strengthens the superiority of CD.<br />
[[File:Dhruv Table2.png|600px|centre]]<br />
<br />
====Capturing Negation====<br />
The paper also shows a way to empirically show how the LSTMs capture negations in a sentence. To search for negations, the following list of negation words were used: not, n’t, lacks, nobody, nor, nothing, neither, never, none, nowhere, remotely. Again using the phrase labels present in SST, the authors search over the training set for instances of negation. Both the positive as well as negative negations are identified. For a given negation phrase, they extract a negation interaction by computing the CD score of the entire phrase and subtracting the CD scores of the phrase being negated and the negation term itself. The resulting score can be interpreted as an n-gram feature. Apart from CD, only leave one out is capable of producing such interaction scores. The distribution of extracted scores is presented in Figure 1. For CD there is a clear distinction between positive and negative negations. Leave one out is able to capture some of the interactions, but has a noticeable overlap between positive and negative negations around zero, indicating a high rate of false negatives.<br />
<br />
====Identifying Similar Phrases====<br />
A key aspect of the CD algorithm is that it helps us learn the value of <math> \beta_t </math> which is essentially a dense embedding vector for a word or a phrase. The way the authors did in the paper is to calculate the AVG(<math> \beta_t </math> ), and then, using similarity measures in the embedding space(eg. cosine similarity) we can easily find similar phrases/words given a phrase/word. The results as shown in Table 3 are qualitatively sensible for 3 different kinds of interactions: positive negation, negative negation and modification, as well as positive and negative words.<br />
[[File:Dhruv Table3.png|600px|centre]]<br />
<br />
==Conclusions==<br />
The paper provides an algorithm called Contextual Decomposition(CD) to interpret predictions made by LSTMs without modifying their architecture. It takes in a trained LSTM and breaks it down in components and quantifies the interpretability of its decision. In both NLP and in general applications CD produces importance scores for words (single variables in general), phrases (several variables together) and word interactions (variable interactions). It also compares the algorithm with state-of-the-art baselines and shows that it performs favorably. It also shows that CD is capable of identifying phrases of varying sentiment and extracting meaningful word (or variable) interactions. It shows the shortcomings of the traditional word-based interpretability approaches for understanding LSTMs and advances the state-of-the-art.<br />
<br />
==Critique/Future Work==<br />
While the method itself is novel in the sense that it moves past the traditional approach of looking just at word level importance scores; it only looks at one specific architecture which is applied to a very simple problem. The authors don't talk about any future directions for this work in the paper itself but a discussion about it happened during the Oral presentation of the paper at ICLR 2018. Following are the important points:<br />
#We could look at interpreting a more complex model, for example, say seq2seq. The author pointed out that he was affirmative that this model could be extended for such purposes although the computational complexity would increase since we would be predicting multiple outputs in this case.<br />
#We could also look at whether this approach could be generalized to completely different architectures like CNN. A later related approach attempted to interpret Neural Networks With Nearest Neighbors to provide a metric that helps to create feature importance values (Reference, Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors). As of now given a new model, we need to manually work out the math for the specific model. Could we develop some general approach towards this? Although the author pointed out that they are working towards using this approach to interpret CNNs.<br />
* It would be an exciting prospect for future work to compare the output of the algorithms with human given scores on a small subset of words.<br />
<br />
==References==<br />
W. James Murdoch, Peter J. Liu, Bin Yu. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. ICLR 2018<br />
<br />
Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural Computation, 9(8): 1735–1780, 1997.<br />
<br />
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.<br />
<br />
Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016. URL http://arxiv.org/abs/1612.08220.<br />
<br />
W James Murdoch and Arthur Szlam. Automatic rule extraction from long short-term memory networks. ICLR, 2017.<br />
<br />
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.<br />
<br />
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017<br />
<br />
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.<br />
<br />
Hendrik Strobelt, Sebastian Gehrmann, Bernd Huber, Hanspeter Pfister, and Alexander M Rush. Visual analysis of hidden state dynamics in recurrent neural networks. arXiv preprint arXiv:1606.07461, 2016.<br />
<br />
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017. URL http://arxiv.org/abs/1703.01365<br />
<br />
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657, 2015<br />
<br />
Jamie Murdoch, Beyond Word Importance: Contextual Decomposition for Interpreting LSTMs. https://www.youtube.com/watch?v=GjpGAyJenCM<br />
<br />
Eric Wallace, Shi Feng, Jordan Boyd-Graber: Interpreting Neural Networks With Nearest Neighbors, URL https://arxiv.org/abs/1809.02847<br />
<br />
Nvidia Corporation: Long Short-Term Memory (LSTM), URL https://developer.nvidia.com/discover/lstm, Accessed: October 21, 2018<br />
<br />
==Appendix==<br />
<br />
[[File:Dhruv Figure4.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure2.png|600px|right]]<br />
<br />
<br />
[[File:Dhruv Figure3.png|600px|left]]<br />
<br />
<br />
[[File:Dhruv Figure1.png|600px|right]]</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-encoders&diff=42369Wasserstein Auto-encoders2018-12-10T23:25:31Z<p>Msminhas: Editorial</p>
<hr />
<div>The first version of this work was published in 2017 and this version (which is the third revision) is presented in ICLR 2018. Source code for the first version is available [https://github.com/tolstikhin/wae here]<br />
<br />
=Introduction=<br />
Early successes in the field of representation learning were based on supervised approaches, which used large labeled datasets to achieve impressive results. On the other hand, popular unsupervised generative modeling methods mainly consisted of probabilistic approaches focusing on low dimensional data. In recent years, there have been models proposed which try to combine these two approaches. One such popular method is called variational auto-encoders (VAEs). VAEs are theoretically elegant but have a major drawback of generating blurry sample images when used for modeling natural images. In comparison, generative adversarial networks (GANs) produce much sharper sample images but have their own list of problems which include a lack of encoder, harder to train, and the "mode collapse" problem. Mode collapse problem refers to the inability of the model to capture all the variability in the true data distribution. Currently, there has been a lot of activities around finding and evaluating numerous GANs architectures and combining VAEs and GANs, but a model which combines the best of both GANs and VAEs is yet to be discovered.<br />
<br />
The work done in this paper builds upon the theoretical work done in Bousquet et al.[2017] [4]. The authors tackle generative modeling using optimal transport (OT). The OT cost is defined as the measure of distance between probability distributions.<br />
<br />
To be more specific on the OT:<br />
<br />
Given a function <math>c : X × Y → R</math>, they seek a minimizer of <math> C(µ, ν) := \underset{π ∈ Π(µ, ν)}{inf} \int_{X×Y}{c(x, y)dπ(x, y)}</math><br />
<br />
The measures <math>π ∈ Π(µ, ν)</math> are called transport plans or transference plans. The measures <math>π ∈ Π(µ, ν)</math> achieving the infimum are called optimal transport plans. The classical interpretation of this problem is the problem of minimizing the total cost <math>C(µ, ν)</math> of transporting the mass distribution <math>µ</math> to the mass distribution <math>ν</math>, where the cost of transporting one unit of mass at the point <math>x ∈ X</math> to one unit of mass at the point <math>y ∈ Y</math> is given by the cost function <math>c(x, y)</math>.<br />
<br />
One of the features of OT cost which is beneficial is that it provides much weaker topology when compared to other costs, including f-divergences which are associated with the original GAN algorithms. <br />
This particular feature is crucial in applications where the data is usually supported on low dimensional manifolds in the input space. The problem with stronger notions of distances such as f-divergences is that they often max out and provide no useful gradients for training. In comparison, the OT cost has been claimed to behave much more nicely [5, 8]. Despite the preceding claim, the implementation, which is similar to GANs, still requires the addition of a constraint or a regularization term into the objective function.<br />
<br />
==Original Contributions==<br />
Let <math>P_X</math> be the true but unknown data distribution, <math>P_G</math> be the latent variable model specified by the prior distribution <math>P_Z</math> of latent codes <math>Z \in \mathcal{Z}</math> and the generative model <math>P_G(X|Z)</math> of the data points <math>X \in \mathcal{X}</math> given <math>Z</math>. The goal in this paper is to minimize <math>OT\ W_c(P_X, P_G)</math>.<br />
<br />
The main contributions are given below:<br />
<br />
* A new class of auto-encoders called Wasserstein Auto-Encoders (WAE). WAEs minimize the optimal transport <math>W_c(P_X, P_G)</math> for any cost function <math>c</math>. As is the case with VAEs, WAE objective function is also made up of two terms: the c-reconstruction cost and a regularizer term <math>\mathcal{D}_Z(P_Z, Q_Z)</math> which penalizes the discrepancy between two distributions in <math>\mathcal{Z}: P_Z \text{ and } Q_Z</math>. <math>Q_Z</math> is a distribution of encoded points, i.e. <math>Q_Z := \mathbb{E}_{P_X}[Q(Z|X)]</math>. Note that when <math>c</math> is the squared cost and the regularizer term is the GAN objective, WAE is equivalent to the adversarial auto-encoders described in [2].<br />
<br />
* Experimental results of using WAE on MNIST and CelebA datasets with squared cost <math>c(x, y) = ||x - y||_2^2</math>. The results of these experiments show that WAEs have the good features of VAEs such as stable training, encoder-decoder architecture, and a nice latent manifold structure while simultaneously improving the quality of the generated samples.<br />
<br />
* Two different regularizers. One based on GANs and adversarial training in the latent space <math>\mathcal{Z}</math>. The other one is based on "Maximum Mean Discrepancy" which is known to have high performance when matching high dimensional standard normal distributions. This second regularizer also makes training a fully adversary-free min-min optimization problem and gets rid of the problem of tuning the GAN. During GAN training, the mode can often collapse, the model is sensitive to hyper-parameters, and the loss is uninterpretable since it fluctuates during training. WAE solves such problems and is much more developer-friendly. Most important of all, the loss in WAE is interpretable, making it easier to decide when to terminate the training.<br />
<br />
* The final contribution is the mathematical analysis used to derive the WAE objective function. In particular, the mathematical analysis shows that in the case of generative models, the primal form of <math>W_c(P_X, P_G)</math> is equivalent to a problem which deals with the optimization of a probabilistic encoder <math>Q(Z|X)</math><br />
<br />
The paper provides an ostensibly simple recipe to implement a non-blurry VAE (it is generative). It provides what looks like an elegant and logical way to cast the Wasserstein distance metric to setup the VAE/GAN problem.<br />
<br />
The paper gives three instructive VAEGAN model comparisons, unifying them thematically – Adversarial Autoencoders (AAE), Adversarial Variational Bayes (AVB), and the original Variational Autoencoders (VAE). These generalizations arise for the case with random decoders – the paper introduces the idea with deterministic decodes, and then extends it to random decoders – with play on the regularizer of the VAE which these papers replace with a GAN.<br />
<br />
=Proposed Method=<br />
<br />
The method proposed by the authors uses a novel auto-encoder architecture to minimize the optimal transport cost <math>W_c(P_X, P_G)</math>. In the optimization problem that follows, the decoder tries to accurately reconstruct the data points as measured by the cost function <math>c</math>. The encoder tries to achieve the following two conflicting goals at the same time: (1) try to match the distribution of the encoded data points <math>Q_Z := \mathbb{E}_{P_X}[Q(Z|X)]</math> to the prior distribution <math>P_Z</math> as measured by the divergence <math>\mathcal{D}_Z(P_Z, Q_Z)</math> and, (2) make sure that the latent space vectors encoded contain enough information so that the reconstruction of the data points are of high quality. The figure below illustrates this:<br />
<br />
[[File:ka2khan_figure_1.png|800px|thumb|center|Figure 1]]<br />
<br />
Figure 1: Both VAE and WAE have objectives which are composed of two terms. The two terms are the reconstruction cost and the regularizer term which penalizes the divergence between <math>P_Z</math> and <math>Q_Z</math>. VAE forces <math>Q(Z|X = x)</math> to match <math>P_Z</math> for the the different training examples drawn from <math>P_X</math>. As shown in the figure above, every red ball representing <math>Q_z</math> is forced to match <math>P_Z</math> depicted as whitish triangles. This causes intersection among red balls and results in reconstruction problems. On the other hand, WAE coerces the mixture <math>Q_Z := \int{Q(Z|X)\ dP_X}</math> to match <math>P_Z</math> as shown in the figure above. This provides a better chance of the encoded latent codes to have more distance between them. As a consequence of this, higher reconstruction quality is achieved.<br />
<br />
==Preliminaries and Notations==<br />
<br />
Authors use calligraphic letters to denote sets (for example, <math>\mathcal{X}</math>), capital letters for random variables (for example, <math>X</math>), and lower case letters for the values (for example, <math>x</math>). Probability distributions are are also denoted with capital letters (for example, <math>P(X)</math>) and the corresponding densities are denoted with lowercase letter (for example, <math>p(x)</math>).<br />
<br />
Several measure of difference between probability distributions are also used by the authors. These include f-divergences given by <math>D_f(p_X||p_G) := \int{f(\frac{p_X(x)}{p_G(x)})p_G(x)}dx\ \text{where}\ f:(0, \infty) &rarr; \mathcal{R}</math> is any convex function satisfying <math>f(1) = 0</math>. Other divergences used include KL divergence (<math>D_{KL}</math>) and Jensen-Shannon (<math>D_{JS}</math>) divergences.<br />
<br />
==Optimal Transport and its Dual Formations==<br />
<br />
A rich class of measure of distances between probability distributions is motivated by the optimal transport problem. One such formulation of the optimal transport problem is the Kantovorich's formulation given by:<br />
<br />
<center><math><br />
W_c(P_X, P_G) := \underset{\Gamma \in \mathcal{P}(X \sim P_X ,Y \sim P_G)}{inf} \mathbb{E}_{(X,Y) \sim \Gamma}[c(X,Y)],<br />
\text{where} \ c(x, y): \mathcal{X} \times \mathcal{X} &rarr; \mathcal{R_{+}}<br />
</math></center><br />
<br />
is any measurable cost function, and <math>\mathcal{P}(X \sim P_X, Y \sim P_G)</math> is a set of all joint distributions of (X, Y) with marginals <math>P_X\ \text{and}\ P_G</math> respectively.<br />
<br />
A particularly interesting case is when <math>(\mathcal{X}, d)</math> is metric space and <math>c(x, y) = d^p(x, y)\ \text{for}\ p &ge; 1</math>. In this case <math>W_p</math>, the <math>p-th</math> root of <math>W_c</math>, is called the p-Wasserstein distance.<br />
<br />
When <math>c(x, y) = d(x, y)</math> the following Kantorovich-Rubinstein duality holds:<br />
<br />
<math>W_1(P_X, P_G) = \underset{f \in \mathcal{F}_L}{sup} \mathbb{E}_{X \sim P_x}[f(X)] = \mathbb{E}_{Y \sim P_G}[f(Y)]</math><br />
where <math>\mathcal{F}_L</math> is the class of all bounded 1-Lipschitz functions on <math>(\mathcal{X}, d)</math>.<br />
<br />
==Application to Generative Models: Wasserstein auto-encoders==<br />
The intuition behind modern generative models like VAEs and GANs is that they try to minimize specific distance measures between the data distribution <math>P_X</math> and the model <math>P_G</math>. Unfortunately, with the current knowledge and tools, it is usually really hard or even impossible to calculate most of the standard discrepancy measures especially when <math>P_X</math> is not known and <math>P_G</math> is parametrized by deep neural networks. Having said that, there are certain tricks available which can be employed to get around that difficulty.<br />
<br />
For KL-divergence <math>D_{KL}(P_X, P_G)</math> minimization, or equivalently the marginal log-likelihood <math>E_{P_X}[log_{P_G}(X)]</math> maximization, one can use the famous variational lower bound which provides a theoretically grounded framework. This has been used quite successfully by the VAEs. In the general case of minimizing f-divergence <math>D_f(P_X, P_G)</math>, using its dual formulation along with f-GANs and adversarial training is viable. Finally, OT cost <math>W_c(P_X, P_G)</math> can be minimized by using the Kantorovich-Rubinstein duality expressed as an adversarial objective. The Wasserstein-GAN implement this idea.<br />
<br />
In this paper, the authors focus on the latent variable models <math>P_G</math> given by a two step procedure. First, a code <math>Z</math> is sampled from a fixed distribution <math>P_Z</math> on a latent space <math>\mathcal{Z}</math>. Second step is to map <math>Z</math> to the image <math>X \in \mathcal{X} = \mathcal{R}^d</math> with a (possibly random) transformation. This gives us a density of the form,<br />
<br />
<center><math><br />
p_G(x) := \int\limits_{\mathcal{Z}}{p_G(x|z)p_z(z)}dz,\ \forall x \in \mathcal{X}, <br />
</math></center><br />
<br />
provided all the probablities involved are properly defined. In order to keep things simple, the authors focus on non-random decoders, i.e., the generative models <math>P_G(X|Z)</math> deterministically map <math>Z</math> to <math>X = G(Z)</math> using a fixed map <math>G: \mathcal{Z} &rarr; \mathcal{X}</math>. Similar results hold for the random decoders as shown by the authors in the appendix B.1.<br />
<br />
Working under the model defined in the preceding paragraph, the authors find that OT cost takes a much simpler form as the transportation plan factors through the map <math>G:</math> instead of finding a coupling <math>\Gamma</math> between two random variables in the <math>\mathcal{X}</math> space, one given by the distribution <math>P_X</math> and the other by the the distribution <math>P_G</math>, it is enough to find a conditional distribution <math>Q(Z|X)</math> such that its <math>Z</math> marginal, <math>Q_Z)Z) := \mathbb{E}_{X \sim P_X}[Q(Z|X)]</math> is the same as the prior distribution <math>P_Z</math>. This is formalized by the theorem given below. The theorem given below was proven in [4] by the authors.<br />
<br />
'''Theorem 1.''' For <math>P_G</math> defined as above with deterministic <math>P_G(X|Z)</math> and any function <math>G:\mathcal{Z} &rarr; \mathcal{X}</math><br />
<br />
<math><br />
\underset{\Gamma \in \mathcal{P}(X \sim P_X ,Y \sim P_G)}{inf} \mathbb{E}_{(X,Y) \sim \Gamma}[c(X,Y)] = \underset{Q: Q_Z = P_Z}{inf} \mathbb{E}_{P_X} \mathbb{E}_{Q(Z|X)}[c(X, G(Z))]<br />
</math><br />
<br />
where <math>Q_Z</math> is the marginal distribution of <math>Z</math> when <math>X \sim P_X</math> and <math>Z \sim Q(Z|X)</math>.<br />
<br />
According to the authors, the result above allows optimization over random encoders <math>Q(Z|X)</math> instead of optimizing overall couplings of <math>X</math> and <math>Y</math>. Both problems are still constrained. To find a numerical solution, the authors relax the constraints on <math>Q_Z</math> by adding a regularizer term to the objective. This gives them the WAE objective:<br />
<br />
<math><br />
D_{WAE}(P_X, P_G) := \underset{Q(Z|X) \in \mathcal{Q}}{inf} \mathbb{E}_{P_X} \mathbb{E}_{Q(Z|X)}[c(X, G(Z))] + \lambda \cdot \mathcal{D}_Z(Q_Z, P_Z)<br />
</math><br />
<br />
where <math>\mathcal{Q}</math> is any nonparametric set of probabilistic encoders, <math>\mathcal{D}_Z</math> is an arbitrary measure of distance between <math>Q_Z</math> and <math>P_Z</math>, and <math>\lambda &gt; 0</math> is a hyperparameter. As is the case with the VAEs, the authors propose using deep neural networks to parameterize both encoders <math>Q</math> and decoders <math>G</math>. Note that, unlike VAEs, WAE allows for non-random encoders deterministically mapping their inputs to their latent codes.<br />
<br />
The authors propose two different regularizers <math>\mathcal{D}_Z(Q_Z, P_Z)</math><br />
<br />
===GAN-based <math>\mathcal{D}_z</math>===<br />
One of the option is to use <math>\mathcal{D}_Z(Q_Z, P_Z) = \mathcal{D}_{JS}(Q_Z, P_Z)</math> along with adversarial training for estimation. In particular, the discriminator (adversary) is used in the latent space <math>\mathcal{Z}</math> to classify "true" points sampled for <math>P_X</math> and "fake" ones samples from <math>Q_Z</math>. This leads to the WAE-GAN as described in Algorithm 1 listed below. Even though WAE-GAN still uses max-min optimization, one positive feature is that it moves the adversary from the input (pixel) space <math>\mathcal{X}</math> to the latent space <math>\mathcal{Z}</math>. Additionally, the true latent space distribution <math>P_Z</math> might have a nice shape with a single mode (for a Gaussian prior), making the task of matching much easier as opposed to matching an unknown, complex, and possibly multi-modal distributions which is usually the case in GANs. This leads to the second penalty.<br />
<br />
===MMD-based <math>\mathcal{D}_z</math>===<br />
For a positive-definite reproducing kernel <math>k: \mathcal{Z} \times \mathcal{Z} &rarr; \mathcal{R}</math>, the maximum mean discrepancy (MMD) is defined as:<br />
<br />
<center><math><br />
MMD_k(P_Z, Q_Z) = \left \Vert \int \limits_{\mathcal{Z}} {k(z, \cdot)dP_Z(z)} - \int \limits_{\mathcal{Z}} {k(z, \cdot)dQ_Z(z)} \right \|_{\mathcal{H}_k}<br />
</math>,</center><br />
<br />
where <math>\mathcal{H}_k</math> is the RKHS (reproducing kernel Hilbert space) of real-valued functions mappings <math>\mathcal{Z}</math> to <math>\mathcal{R}</math>. If <math>k</math> is characteristic then <math>MMD_k</math> defines a metric and can be used as a distance measure. The authors propose to use <math>\mathcal{D}_Z(P_Z, Q_Z) = MMD_k(P_Z, Q_Z)</math>. MMD also have an unbiased U-statistic estimator which can be used along with stochastic gradient descent (SGD) methods. This gives us WAE-MMD as described in Algorithm 2 listed below. Note that MMD is known to perform well when matching high dimensional standard normal distributions, so it is expected that this penalty will work well when the prior <math>P_Z</math> is Gaussian.<br />
<br />
[[File:ka2khan_figure_2.png|800px|thumb|center|Algorithms- WAE-GAN on left and WAE-MMD on right]]<br />
<br />
=Related Work=<br />
==Literature on auto-encoders==<br />
Classical unregularized auto-encoders have an objective function which only tries to minimize the reconstruction cost. This results in distinct data points being encoded into distinct zones distributed chaotically across the latent space <math>\mathcal{Z}</math>. The latent space <math>\mathcal{Z}</math> in this scenario contains huge "holes" for which the decoder <math>P_G(X|Z)</math> has never been trained. In general, the encoder trained this way do not provide terribly useful representations and sampling from the latent space <math>\mathcal{Z}</math> becomes a difficult task [12].<br />
<br />
VAEs [1] minimize the KL-divergence <math>D_{KL}(P_X, P_G)</math> which consists of the reconstruction cost and the regularizer <math>\mathbb{E}_{P_X}[D_{KL}(Q(|X), P_Z)]</math>. The regularizer penalizes the difference in the encoded training images and the prior <math>P_Z</math>. But this penalty still does not guarantee that the overall encoded distribution matches the prior distribution as WAE does. In addition, VAEs require a non-degenerate (i.e. non-deterministic) Gaussian encoders along with random decoders. Another paper [11] later, proposed a method which allows the use of non-Gaussian encoders with VAEs. In the meanwhile, WAE minimizes <math>W_{c}(P_X, P_G)</math> and allows probabilistic and deterministic encoder and decoder pairs.<br />
<br />
When parameters are appropriately defined, WAE is able to generalize AAE in two ways: it can use any cost function in the input space and use any discrepancy measure <math>D_Z</math> in latent space <math>Z</math> other than the adversarial one.<br />
<br />
There has been work done on regularized auto-encoders called InfoVAE [14], which has objective similar to [4] but using different motivations and arguments.<br />
<br />
WAEs explicitly define the cost function <math>c(x,y)</math>, whereas VAEs rely on an implicitly through a negative log likelihood term. It theoretically can induce any arbitrary cost function, but in practice can require an estimation of the normalizing constant that can be different for values of <math>z</math>.<br />
<br />
==Literature on Optimal Transport (OT)==<br />
[15] provides methods for computing OT cost for large-scale data using SGD and sampling. The WGAN [5] proposes a generative model which minimizes 1-Wasserstein distance <math>W_1(P_X, P_G)</math>. The WGAN algorithm does not provide an encoder and cannot be easily applied to any arbitrary cost <math>W_C</math>. The model proposed in [5] uses the dual form, in contrast, the model proposed in this paper uses the primal form. The primal form allows the use of any arbitrary cost function <math>c</math> and naturally, comes with an encoder. <br />
<br />
In order to compute <math>W_c(P_X, P_G)</math> or <math>W_1(P_X, P_G)</math>, the model needs to handle various non-trivial constraints, various methods has be proposed in the literature ([5], [2], [8], [16], [15], [17], [18]) to avoid this difficulty .<br />
<br />
==Literature on GANs==<br />
A lot of the GAN variations which have been proposed in the literature come without an encoder. Examples include WGAN and f-GAN. These models are deficient in cases where a reconstruction of latent space is needed to use the learned manifold.<br />
<br />
There have been numerous models proposed in the literature which try to combine the adversarial training of GANs with auto-encoder architectures. Some examples are [19], [20], [21], and [22]. There has also been work done in which reproducing kernels have been used in the context of GANS ([23], [24]).<br />
<br />
=Experiments=<br />
Experiments were used to empirically evaluate the proposed WAE model. <br />
<br />
'''Experimental setup'''<br />
<br />
For experimental setup, authors used <math> \small P_Z</math> and squared cost function <math> \small c(x,y)</math> for data points.<br />
Deterministic encoder-decoder pairs were used. The authors conducted experiments using the following two real-world datasets: (1) MNIST [27] made up of 70k images, and (2) CelebA [28] consisting of approximately 203k images. For test reconstruction and interpolations a pair of held out images, <math>(x,y)</math> from the test set are Auto-encoded (separately), to produce <math>(z_x, z_y)</math> in the latent space<br />
<br />
'''Training Details - MNIST'''<br />
<br />
Authors use mini-batches of size 100 and trained the models for 100 epochs. They used λ = 10 and variance of 1. For the encoder-decoder pair, they set α = 0.01 for Adam in the beginning and for the adversary in WAE-GAN to α = 0.005. After 30 epochs they decreased both by a factor of 2, and after first 50 epochs further by a factor of 5. Both encoder and decoder used fully convolutional architectures with 4x4 convolutional filters.<br />
<br />
'''Training Details - CelebA'''<br />
<br />
Authors took the CelebA images and conducted 140x140 center crops and then resized to the 4x64 resolution. They again used mini-batches of size 100 and trained the models for upto 250 epochs. All reported WAE models were trained for 55 epochs and VAE for 68 epochs. For WAE-MMD we used λ = 100 and for WAE-GAN λ = 1. Both used variance of 2.<br />
<br />
For WAE-MMD the learning rate of Adam was initially set to α = 0.01 . For WAE-GAN the learning rate of Adam for the encoder-decoder pair was initially set to α = 0.003 and for the adversary to 0.01. All learning rates were decreased by a factor of 2 after 30 epochs, further by a factor of 5 after 50 first epochs, and finally additional factor of 10 after 100 first epochs.<br />
<br />
The main evaluation criteria were to see if the WAE model can simultaneously achieve: <br />
<br />
<ol><br />
<li>accurate reconstruction of the data points</li><br />
<li>resonable geometry of the latent manifold</li><br />
<li>generation of high quality random samples</li><br />
</ol><br />
<br />
For the model to generalize well (1) and (2) should be met on both the training and test data set.<br />
<br />
The proposed model achieve reasonably good results as highlighted in the figures given below:<br />
<br />
[[File:ka2khan_figure_3.png|800px|thumb|center|Using CelebA dataset]]<br />
<br />
[[File:ka2khan_figure_4.png|800px|thumb|center|Using CelebA dataset, FID (Fréchet Inception Distance<br />
[32]): smaller is better, sharpness: larger is better]]<br />
<br />
=Conclusion=<br />
The authors proposed a new class of algorithms for building a generative model called Wasserstein Autoencoders based on optimal transport cost. They related the newly proposed model to the existing probabilistic modeling techniques. They empirically evaluated the proposed models using two real-world datasets. They compared the results obtained using their proposed model with the results obtained using VAEs on the same dataset to show that the proposed models generate sample images of higher quality in addition to being easier to train and having good reconstruction quality of the data points.<br />
<br />
The authors claim that in future work, they will further explore the criteria for matching the encoding distribution <math>Q_Z</math> to the prior distribution <math>P_Z</math>, evaluate whether it is feasible to adversarially train the cost function <math>c</math>in the input space <math>\mathcal{X}</math>, and a theoretical analysis of the dual-formations for WAE-GAN and WAE-MMD.<br />
<br />
=Future Work=<br />
Following the work of this paper, another generative model was introduced by [34] that is based on the concept of optimal transport. Optimal transport is basically the distance between probability distributions by transporting one of the distributions to the other (and hence the name of optimal transport). Then, a new simple model called "Sliced-Wasserstein Autoencoders" (SWAE) is presented, which is easily implemented, and provides the capabilities of Wasserstein Autoencoders.<br />
<br />
([https://openreview.net/forum?id=HkL7n1-0b]) The results from MNIST and CelebA datasets look convincing, though could include additional evaluation to compare the adversarial loss with the straightforward MMD metric and potentially discuss their pros and cons. In some sense, given the challenges in evaluating and comparing closely related auto-encoder solutions, the authors could design demonstrative experiments for cases where Wassersterin distance helps and maybe its potential limitations.<br />
<br />
=Critique=<br />
<br />
Although this paper presented some empirical tests to explain its method in an appropriate way, it would be better to provide some clearer notations including the details of the architectures in their experiments. Furthermore, they could benefit from performing some comparisons between the results of their work and other similar works. As pointed out by a reviewer, the closest work to this paper is the adversarial variational Bayes framework by Mescheder et.al. which also attempts at unifying VAEs and GANs. Although the authors describe the conceptual differences and advantages over that approach, it will be beneficial to actually include some comparisons in the results section.<br />
Moreover, the performance of the algorithm is not a significant improvement compared to previous VAE algorithm. The performance can be described and tested if the author performed empirical tests on various datasets. However, the methodology is flexible and unified to other types of the algorithm which is a huge benefit without compromising the stability of the training.<br />
<br />
=References=<br />
[1] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.<br />
<br />
[2] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR, 2016.<br />
<br />
[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.<br />
<br />
[4] O. Bousquet, S. Gelly, I. Tolstikhin, C. J. Simon-Gabriel, and B. Schölkopf. From optimal transport to generative modeling: the VEGAN cookbook, 2017.<br />
<br />
[5] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017.<br />
<br />
[6] C. Villani. Topics in Optimal Transportation. AMS Graduate Studies in Mathematics, 2003.<br />
<br />
[7] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.<br />
<br />
[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Domoulin, and A. Courville. Improved training of wasserstein GANs, 2017.<br />
<br />
[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.<br />
<br />
[10] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.<br />
<br />
[11] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks, 2017.<br />
<br />
[12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, 35, 2013.<br />
<br />
[13] M. D. Hoffman and M. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In NIPS Workshop on Advances in Approximate Bayesian Inference, 2016.<br />
<br />
[14] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information maximizing variational autoencoders, 2017.<br />
<br />
[15] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pages 3432–3440, 2016. <br />
<br />
[16] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.<br />
<br />
[17] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: geometry and kantorovich formulation. arXiv preprint arXiv:1508.05216, 2015.<br />
<br />
[18] Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems and a new hellinger-kantorovich distance between positive measures. arXiv preprint arXiv:1508.07941, 2015.<br />
<br />
[19] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.<br />
<br />
[20] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. In ICLR, 2017.<br />
<br />
[21] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generator-encoder networks, 2017.<br />
<br />
[22] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks, 2017.<br />
<br />
[23] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015. <br />
<br />
[24] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.<br />
<br />
[25] R. Reddi, A. Ramdas, A. Singh, B. Poczos, and L. Wasserman. On the high-dimensional power of a linear-time two sample test under mean-shift alternatives. In AISTATS, 2015.<br />
<br />
[26] C. L. Li, W. C. Chang, Y. Cheng, Y. Yang, and B. Poczos. Mmd gan: Towards deeper understanding of moment matching network, 2017.<br />
<br />
[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86(11), pages 2278–2324, 1998.<br />
<br />
[28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.<br />
<br />
[29] D. P. Kingma and J. Lei. Adam: A method for stochastic optimization, 2014.<br />
<br />
[30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.<br />
<br />
[31] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.<br />
<br />
[32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.<br />
<br />
[33] B. Poole, A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for GANs, 2016.<br />
<br />
[34] S. Kolouri, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoencoder: An embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=42368a neural representation of sketch drawings2018-12-10T23:17:06Z<p>Msminhas: Editorial</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, the authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. People, however, learn to draw using sequences of strokes as opposed to the simultaneous generation of pixels. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation.<br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot [26, 28] and some reinforcement learning approaches[28], Reinforcement Learning to discover a set of paint brush strokes that can best represent a given input photograph. They work more like a mimic of digitized photographs. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models [25] or Mixture Density Networks [2] to generate human sketches, continuous data points (modeling Chinese characters as a sequence of pen stroke actions) or vectorized Kanji characters [9,29].<br />
<br />
Neural Network-based approaches are able to generate latent space representation of vector images, which follows a Gaussian distribution. The generated output of these networks is trained to match the Gaussian distribution by minimizing a given loss function. Using this idea, previous works attempted to generate a sequence-to-Sequence model with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset. Variational Autoencoders differ from regular encoders in that there is an intermediary “sampling step” between the encoder and decoder. Simply connecting the two would NOT guarantee that encoded parameters can be viewed as parameters of a normal distribution representing a latent space. In VAEs, the output of the encoder is physically put into an intermediary step that uses it as normal parameters and provides a sample. In this way, the encoding is penalized as if it were the parameters of some Normal Distribution.<br />
<br />
One of the limiting factors that the authors mention in the field of generative vector drawings is the lack of availability of publicly available datasets. Previous datasets such as the Sketch data with 20k vector sketches was explored for feature extraction techniques. The Sketchy dataset consisting of 70k vector sketches along with pixel images was used for large-scale exploration of human sketches. The ShadowDraw system that used 30k raster images along with extracted vectorized features is an interactive system<br />
that predicts what a finished drawing looks like based on a set of incomplete brush strokes from the<br />
user while the sketch is being drawn. In all the cases, the datasets are comparatively small. The dataset proposed in this work uses a much larger dataset and has been made publicly available, and is one of the major contributions of this paper.<br />
<br />
== Major Contributions ==<br />
This paper makes the following major contributions: Authors outline a framework for both unconditional and<br />
conditional generation of vector images composed of a sequence of lines. The recurrent neural<br />
network-based generative model is capable of producing sketches of common objects in a vector<br />
format. The paper develops a training procedure unique to vector images to make the training more robust. The paper also made available<br />
a large dataset of hand drawn vector images to encourage further development of generative modeling<br />
for vector images, and also release an implementation of our model as an open source project<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by an online game [https://quickdraw.withgoogle.com/# Quick Draw!], where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples, and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px|center]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). <br />
<br />
==== Encoder ====<br />
The encoder is a bidirectional RNN. The input is a sketch sequence denoted by <math>S =\{S_0, S_1, ... S_{N_{s}}\}</math> and a reversed sketch sequence denoted by <math>S_{reverse} = \{S_{N_{s}},S_{N_{s}-1}, ... S_0\}</math>. The final hidden layer representations of the two encoded sequences <math>(h_{ \rightarrow}, h_{ \leftarrow})</math> are concatenated to form a latent vector, <math>h</math>, of size <math>N_{z}</math>,<br />
<br />
\begin{split}<br />
&h_{ \rightarrow} = encode_{ \rightarrow }(S), \\<br />
&h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), \\<br />
&h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{split}<br />
<br />
Then the authors project <math>h</math> into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> of size <math>N_{z}</math>. The projection is performed using a fully connected layer. These two vectors are the parameters of the latent space Gaussian distribution that will estimate the distribution of the input data. Because standard deviations cannot be negative, an exponential function is used to convert it to all positive values. Next, a random variable with mean <math>\mu</math> and standard deviation <math>\sigma</math> is constructed by scaling a normalized IID Gaussian, <math>\mathcal{N}(0,I)</math>, <br />
<br />
\begin{split}<br />
& \mu = W_\mu h + b_\mu, \\<br />
& \hat \sigma = W_\sigma h + b_\sigma, \\<br />
& \sigma = exp( \frac{\hat \sigma}{2}), \\<br />
& z = \mu + \sigma \odot \mathcal{N}(0,I). <br />
\end{split}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a random vector that can be conditioned on an input sketch sequence.<br />
<br />
==== Decoder ====<br />
The decoder is an autoregressive RNN. The initial hidden and cell states are generated using <math>[h_0;c_0] = \tanh(W_z z + b_z)</math>. Here, <math>c_0</math> is utilized if applicable (eg. if an LSTM decoder is used). <math>S_0</math> is defined as <math>(0,0,1,0,0)</math> (the pen is touching the paper at location 0, 0). <br />
<br />
For each step <math>i</math> in the decoder, the input <math>x_i</math> is the concatenation of the previous point <math>S_{i-1}</math> and the latent vector <math>z</math>. The outputs of the RNN decoder <math>y_i</math> are parameters for a probability distribution that will generate the next point <math>S_i</math>. <br />
<br />
The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as a categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1,<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1<br />
\end{align*}<br />
<br />
Where <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is a bi-variate Normal Distribution, with parameters means <math>\mu_x, \mu_y</math>, standard deviations <math>\sigma_x, \sigma_y</math> and correlation parameter <math>\rho_{xy}</math>. There are <math>M</math> such distributions. <math>\Pi</math> is a categorical distribution vector of length <math>M</math>. Collectively these form the mixture weights of the Gaussian Mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{split}<br />
&x_i = [S_{i-1}; z], \\<br />
&[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), \\<br />
&y_i = W_y h_i + b_y, \\<br />
&y_i \in \mathbb{R}^{6M+3}. \\<br />
\end{split}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x),\ <br />
\sigma_y = \exp (\hat \sigma_y),\ <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.<br />
<br />
=== Unconditional Generation ===<br />
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter <math>\tau = 0.2</math> at the top in blue, to <math>\tau = 0.9</math> at the bottom in red.<br />
<br />
[[File:sketchfig3.png|700px|center]]<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an i.i.d. Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.<br />
<br />
While the aforementioned loss function could be used, it was found that annealing the KL term (as shown below) in the loss function produces better results.<br />
<br />
<center><math><br />
\eta_{step} = 1 - (1 - \eta_{min})R^{step}<br />
</math></center><br />
<br />
<center><math><br />
Loss_{train} = L_R + w_{KL} \eta_{step} max(L_{KL}, KL_{min})<br />
</math></center><br />
<br />
As shown in Figure 4, the <math>L_{R} </math> metric for the standalone decoder model is actually an upper bound for different models using a latent vector. The reason is the unconditional model does not access to the entire sketch it needs to generate.<br />
<br />
[[File:s.png|600px|thumb|center|Figure 4. Tradeoff between <math>L_{R} </math> and <math>L_{KL} </math>, for two models trained on single class datasets (left).<br />
Validation Loss Graph for models trained on the Yoga dataset using various <math>w_{KL} </math>. (right)]]<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. HyperLSTM is a type of RNN cell that excels at sequence generation tasks. The ability for HyperLSTM to spontaneously augment its own weights enables it to adapt to many different regimes<br />
in a large diverse dataset. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px|center]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly. Furthermore, <math>L_R</math> decreases as <math>w_{KL} </math> is halfed. <br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px|center]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px|center]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set <math>τ = 0.8</math> to complete samples. Figure 7 shows the results.<br />
<br />
[[File:sketchfig7.png|700px|center]]<br />
<br />
== Limitations ==<br />
<br />
Although sketch-rnn can model a large variety of sketch drawings, there are several limitations in the current approach. For most single-class datasets, sketch-rnn is capable of modeling around 300 data points. The model becomes increasingly difficult to train beyond this length. For the author's dataset, the Ramer-Douglas-Peucker algorithm is used to simplify the strokes of sketch data to less than 200 data points.<br />
<br />
For more complicated classes of images, such as mermaids or lobsters, the reconstruction loss metrics are not as good compared to simpler classes such as ants, faces or firetrucks. The models trained on these more challenging image classes tend to draw smoother, more circular line segments that do not resemble individual sketches, but rather resemble an averaging of many sketches in the training set. This smoothness may be analogous to the blurriness effect produced by a Variational Autoencoder that is trained on pixel images. Depending on the use case of the model, smooth circular lines can be viewed as aesthetically pleasing and a desirable property.<br />
<br />
While both conditional and unconditional models are capable of training on datasets of several classes, sketch-rnn is ineffective at modeling a large number of classes simultaneously. The samples generated will be incoherent, with different classes are shown in the same sketch.<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience<br />
<br />
This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments. <br />
When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
The authors conclude by providing the following future directions to this work:<br />
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.<br />
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.<br />
<br />
It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.<br />
<br />
The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking<br />
sketch of the object composed of a minimal number of lines to be a more interesting problem.<br />
<br />
Moreover, it would be interesting to see how varying loss will be represented as a drawing. Some exotic form of loss function may change the way that the network behaves, which can lead to various applications.<br />
<br />
== Conclusion ==<br />
The paper presents a methodology to model sketch drawings using recurrent neural networks. The sketch-rnn model that can encode and decode sketches, generate and complete unfinished sketches is introduced in this paper. In addition, Authors demonstrated how to both interpolate between latent spaces from a different class, and use it to augment sketches or generate similar looking sketches. Furthermore, the importance of enforcing a prior distribution on latent vector while interpolating coherent sketch generations is shown. Finally, a large sketch drawings dataset for future research work is created.<br />
<br />
== Critique ==<br />
This paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches. Although its exciting to read about, many improvements can be done.<br />
<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment. The authors didn't present an evaluation of the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance. Training loss alone likely does not capture the quality of a sketch.<br />
<br />
* The authors have not mentioned details on training details such as learning rate, training time, parameter size, and so on. <br />
<br />
* Algorithm lacks comparison to the prior state of the art on standard metrics, which made the novelty unclear. Using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors did not present better complexity and deeper mathematical analysis on the algorithms in the paper. It also does not include comparison using some more standard metrics compare to previous results. Therefore, it lacks some algorithmic contribution. It would be better to include some more formal analysis on the algorithmic side. <br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
* As they said their model can become increasingly difficult to train on with increased size.<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. [https://arxiv.org/abs/1609.04468 ArXiv e-prints], September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=42367DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-12-10T22:54:51Z<p>Msminhas: Editorial</p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modeling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example, post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix. This approach is efficient because it avoids searching over an exponential solution space of interaction candidates by making an approximation of hidden unit importance at the first hidden layer via all weights above and doing a 2D traversal of the input weight matrix.<br />
<br />
Note that in this paper, we only consider one specific types of neural network, feedforward neural network. Based on the methodology discussed here, the authors suggest that we can build an interpretation method for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves. Two-way ANOVA has been a standard method of performing pairwise interaction detection that involves conducting hypothesis tests for each interaction candidate by checking each hypothesis with F-statistics (Wonnacott & Wonnacott, 1972). Additive Groves is another method that conducts individual tests for interactions and hence must face the same computational difficulties; however, it is special because the interactions it detects are not constrained to any functional form.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
<br />
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations, providing a tool for visualizing live activations on each layer of a trained CNN, and another for visualizing "Regularized Optimization".) <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
* Sum product networks, Hoifun Poon, Pedro Domingos (2011) It is a new deep architecture that provides clear semantics. In its core, it is a probabilistic model, with two types of nodes: Sum node and Product nodes. The sum nodes are trying to model the mixture of distributions and product node is trying to model joint distributions. It can be trained using gradient descent and other methods as well. The main advantage of the Sum-Product Network is that it has clear semantics, where people can interpret exactly how the network models make decisions. Therefore, it has better interpretability than most of the current deep architectures. <br />
<br />
The approach in this paper is to extract non-additive interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive into methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrices are defined with bold-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interaction' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural networks, most of the interactions are happening within hidden layers. This means that we need a proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertex that has all of the features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement guarantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find appropriate measure which can summarize the information between those two layers.<br />
<br />
Before doing so, let's think about a single-layered neural network. For any single hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer. The figure below illustrates an interaction within a fully connected feedforward neural network, where the box contains later layers in the network.<br />
<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider the interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gradient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gradients. Gradient has been an import variable of measuring the influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut-off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,<br />
<br />
<center><math><br />
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)<br />
</math></center><br />
<br />
From the above model, each of <math>g_i</math> and <math>g_i'</math> are Feed-Forward neural networks. <math>g_i(\cdot)</math> captures the main effects, while <math>g_i'(\cdot)</math> captures the interaction. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the neural network models, first model will be MLP, the second model will be MLP-M, which is MLP with an additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. In the experiments that the authors performed, all the networks which modeled feature interactions consisted of four hidden layers containing 140, 100, 60, and 20 units respectively. Whereas, all the individual univariate networks contained three hidden layers with each layer containing 10 units. All of these networks used ReLU activation and back-propagation for training. The MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, the authors study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, the authors are going to test on 10 synthetic functions as shown in Table I.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
The authors use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing<br />
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.<br />
<br />
And the authors also reported the results of comparisons between the models. As you can see, neural network based models are performing better on average. Compared to the traditional methods like ANOVA, MLP and MLP-M, proposed method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly capture the most influential pair-wise interactions.<br />
<br />
=Higher-order interaction detection=<br />
The authors use their greedy interaction ranking algorithm to perform higher-order interaction detection without an exponential search of interaction candidates.<br />
[[File:higher-order_interaction_detection.png|700px|center]]<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still has some limitations. For example, for the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreover, a correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
In the case of detecting pairwise interactions, the interlinked pairwise interactions are often confused by the algorithm for complex interactions. This means that the higher-order interaction algorithm fails to separate interlinked pairwise interactions encoded in the neural network. Another issue is that it sometimes detects abrupt interactions or misses interactions as a result of correlations between features<br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.<br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremely useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of the practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks. Also, it was pointed out that the neural network weights heavily depend on L-1 regularized neural network training, but a group lasso penalty may work better.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
3. Greedy algorithm is implemented but nothing is mentioned about the speed of this algorithm which is definitely not fast. So, this has the potential to be a weak point of the study.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>Msminhashttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=42366Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-12-10T22:39:46Z<p>Msminhas: Editorial</p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states is finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in<br />
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.<br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, the potential for complicated strategies, and the need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game-play, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Game play ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play and must touch or be inside concentric rings (12 feet diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. <br />
<br />
<br />
Go game:<br />
* Start with 19x19 empty board<br />
* One player takes black stones and the other take white stones<br />
* Two players take turns to put stones on the board<br />
* Once the stone has been placed, the stones cannot be moved anymore<br />
* Rules:<br />
1. If one connected part is completely surrounded by the opponent's stones, remove it from the board<br />
<br />
2. Ko rule: Forbids a board play to repeat a board position<br />
* End when there are no valuable moves. <br />
* Count the territory of both players. The objective of the game is to capture more territory than your opponent. The player with black stone plays first. However, the black player needs to give 7.5 points to whites points (called Komi) as a tradeoff. There are some variations on how much points the player with the black stone should give based on different rules in different Asia countries.<br />
* This game used to be a huge challenge to artificial intelligence due to two reasons. One is the search space is extremely large. It is estimated to be on the order of (<math>10^{172}</math>), which is more than the number of atoms in the universe, and it is much larger than the game states in Chess (<math>10^{47}</math>). Another reason is there was no good heuristic function for evaluating a situation in Go. So the traditional alpha-beta pruning algorithm will not have good performance due to the poor heuristic function. For Alpha go lee, the CNN plays a role like a good heuristic function, which results on the huge performance improvement of the AI.<br />
[[File:go.JPG|700px|center]]<br />
<br />
Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.<br />
<br />
The value