http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=X832zhan&feedformat=atomstatwiki - User contributions [US]2024-03-28T11:12:47ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41883DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-29T17:29:42Z<p>X832zhan: /* Conclusion */</p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modeling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract non-additive interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer.<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41875DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-29T17:10:17Z<p>X832zhan: /* Interaction */</p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modeling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
A major advantage of neural networks is their ability to model complex statistical interactions between features for automatic feature learning. In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract non-additive interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer.<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41853DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-29T16:51:07Z<p>X832zhan: /* Related Work */</p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modeling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
A major advantage of neural networks is their ability to model complex statistical interactions between features for automatic feature learning. In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract non-additive interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space.<br />
<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41852DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-29T16:50:24Z<p>X832zhan: /* Related Work */</p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modeling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
A major advantage of neural networks is their ability to model complex statistical interactions between features for automatic feature learning. In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space.<br />
<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DETECTING_STATISTICAL_INTERACTIONS_FROM_NEURAL_NETWORK_WEIGHTS&diff=41847DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS2018-11-29T16:45:22Z<p>X832zhan: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
<br />
It has been commonly believed that one major advantage of neural networks is their capability of modeling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.<br />
<br />
With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.<br />
<br />
Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].<br />
<br />
A major advantage of neural networks is their ability to model complex statistical interactions between features for automatic feature learning. In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix<br />
<br />
Note that in this paper, we only consider one specific types of neural network, Feed-Forward Neural Network. Based on the methodology discussed here, the authors suggest that we can build an interpretation methodology for other types of networks also.<br />
<br />
=Related Work=<br />
<br />
1. Interaction Detection approaches: <br />
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves.<br />
* Define all interaction forms of interest, then later finds the important ones.<br />
- The paper's goal is to detect interactions without compromising the functional forms.<br />
<br />
2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:<br />
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.<br />
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations. <br />
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.<br />
<br />
The approach in this paper is to extract interactions between variables from the neural network weights.<br />
<br />
=Notations=<br />
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.<br />
<br />
1. Vector: Vectors are defined with bold-lowercases, '''v, w'''<br />
<br />
2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''<br />
<br />
3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}<br />
<br />
=Interaction=<br />
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below. <br />
<br />
[[File:def_interaction.PNG|900px|center]]<br />
<br />
From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.<br />
<br />
Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.<br />
<br />
One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.<br />
<br />
The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:<br />
<br />
<br />
[[File:prop2.PNG|900px|center]]<br />
<br />
Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.<br />
}<br />
Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space.<br />
<br />
[[File:network1.PNG|500px|center]]<br />
<br />
==Measuring influence in hidden layers==<br />
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as, <br />
<br />
[[File:def3.PNG|900px|center]]<br />
<br />
<br />
Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.<br />
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.<br />
<br />
==Quantifying influence==<br />
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as, <br />
<br />
[[File:measure1.PNG|900px|center]]<br />
<br />
The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer. <br />
<br />
For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.<br />
<br />
Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:<br />
<br />
[[File:algorithm1.PNG|850px|center]]<br />
<br />
=Cut off Model=<br />
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut off model which is a generalized additive model (GAM) as below,<br />
<br />
[[File:gam1.PNG|900px|center]]<br />
<br />
From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.<br />
<br />
=Experiment=<br />
For the experiment, we are going to compare three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. MLP-M model is graphically represented below.<br />
<br />
[[File:output11.PNG|300px|center]]<br />
<br />
For the experiment, we are going to test on 10 synthetic functions.<br />
<br />
[[File:synthetic.PNG|900px|center]]<br />
<br />
And the author also reported the results of comparisons between the models. As you can see, neural network based models are performing better in average. Compare to the traditional methods liek ANOVA, MLP and MLP-M method shows 20% increases in performance.<br />
<br />
[[File:performance_mlpm.PNG|900px|center]]<br />
<br />
<br />
[[File:performance2_mlpm.PNG|900px|center]]<br />
<br />
The above result shows that MLP-M almost perfectly catch the most influential pair-wise interactions.<br />
<br />
=Limitations=<br />
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms. <br />
<br />
Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings. <br />
<br />
=Conclusion=<br />
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.<br />
<br />
=Critique=<br />
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.<br />
<br />
2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.<br />
<br />
=Reference=<br />
<br />
[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. <br />
<br />
[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.<br />
<br />
[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.<br />
<br />
[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016. <br />
<br />
[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018. <br />
<br />
[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.<br />
<br />
[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006<br />
<br />
[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.<br />
<br />
[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.<br />
<br />
[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.<br />
<br />
[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.<br />
<br />
[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.<br />
<br />
[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.<br />
<br />
[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.<br />
<br />
[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41804learn what not to learn2018-11-29T06:19:13Z<p>X832zhan: /* Conclusion */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for an agent to learn when the action space is large, especially the difficulties from function approximation and exploration. In some cases many actions are irrelevant and it is sometimes easier for the algorithm to learn which action not to take. The paper proposes a new reinforcement learning approach for dealing with large action spaces based on action elimination by restricting the available actions in each state to a subset of the most likely ones. There is a core assumption that this method can be easier to predict which actions in each state are invalid or inferior and use that information to control. More specifically, it proposes a system that learns the approximation of a Q-function and concurrently learns to eliminate actions. The method utilizes an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played (e.g., Player: "Climb the tree." Parser: "There are no trees to climb"). Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focuses on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a Deep Q-Network (DQN) and an Action Elimination Network (AEN), both using the Convolutional Neural Networks (CNN) for Natural Language Processing (NLP) tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which lets players to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AEN algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:lnottol_fig1.png|500px|center]]<br />
<br />
All states and actions are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.<br />
<br />
=Action Elimination=<br />
<br />
The approach in the paper builds on the standard Reinforcement Learning formulation. At each time step <math>t</math>, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and observes next state <math display="inline">s_{t+1} </math> according to a transition kernel <math>P(s_{t+1}|s_t,a_t)</math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discounted cumulative return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]</math>, where <math> 0< \gamma <1 </math>. The Q-function is <math display="inline">Q^\pi(s,a)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s,a_0=a]</math>, and it can be optimized by Q-learning algorithm.<br />
<br />
After executing an action, the agent observes a binary elimination signal <math>e(s, a)</math> to determine which actions not to take. It equals 1 if action <math>a</math> may be eliminated in state <math>s</math> (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following definitions:<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation errors. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping (i.e., only the Q-values of the valid state-action pairs) leading to faster convergence and better solution.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. Assume that there are <math>A'</math> actions that should be eliminated and are <math>\epsilon</math>-optimal, i.e. their value is at least <math>V^*(s)-\epsilon</math>. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
Because it is difficult to embedde the elimination signal into the MDP, the authors use contextual multi-armed bandits to decouple the elimination signal from the MDP, which can correctly eliminate actions when applying standard Q learning into learning process.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:lnottol_fig1b.png|300px|center]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:lnottol_fig2.png|750px|center]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png|1200px|thumb|center|Performance of agents in the egg quest.]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png|400px|thumb|center|Results in the Troll Quest.]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[]]<br />
<br />
[[File:AEF_open_zork_comparison.png|600px|thumb|center|Results in "Open Zork".]]<br />
<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.<br />
<br />
For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.<br />
<br />
They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.<br />
<br />
The authors also hope to generate elimination signals in real-world domains and achieve the purpose of eliminating the signal implicitly.<br />
<br />
=Critique=<br />
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO. <br />
<br />
=Reference=</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41803learn what not to learn2018-11-29T06:14:45Z<p>X832zhan: /* Action Elimination */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for an agent to learn when the action space is large, especially the difficulties from function approximation and exploration. In some cases many actions are irrelevant and it is sometimes easier for the algorithm to learn which action not to take. The paper proposes a new reinforcement learning approach for dealing with large action spaces based on action elimination by restricting the available actions in each state to a subset of the most likely ones. There is a core assumption that this method can be easier to predict which actions in each state are invalid or inferior and use that information to control. More specifically, it proposes a system that learns the approximation of a Q-function and concurrently learns to eliminate actions. The method utilizes an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played (e.g., Player: "Climb the tree." Parser: "There are no trees to climb"). Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focuses on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a Deep Q-Network (DQN) and an Action Elimination Network (AEN), both using the Convolutional Neural Networks (CNN) for Natural Language Processing (NLP) tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which lets players to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AEN algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:lnottol_fig1.png|500px|center]]<br />
<br />
All states and actions are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.<br />
<br />
=Action Elimination=<br />
<br />
The approach in the paper builds on the standard Reinforcement Learning formulation. At each time step <math>t</math>, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and observes next state <math display="inline">s_{t+1} </math> according to a transition kernel <math>P(s_{t+1}|s_t,a_t)</math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discounted cumulative return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]</math>, where <math> 0< \gamma <1 </math>. The Q-function is <math display="inline">Q^\pi(s,a)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s,a_0=a]</math>, and it can be optimized by Q-learning algorithm.<br />
<br />
After executing an action, the agent observes a binary elimination signal <math>e(s, a)</math> to determine which actions not to take. It equals 1 if action <math>a</math> may be eliminated in state <math>s</math> (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following definitions:<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation errors. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping (i.e., only the Q-values of the valid state-action pairs) leading to faster convergence and better solution.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. Assume that there are <math>A'</math> actions that should be eliminated and are <math>\epsilon</math>-optimal, i.e. their value is at least <math>V^*(s)-\epsilon</math>. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
Because it is difficult to embedde the elimination signal into the MDP, the authors use contextual multi-armed bandits to decouple the elimination signal from the MDP, which can correctly eliminate actions when applying standard Q learning into learning process.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:lnottol_fig1b.png|300px|center]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:lnottol_fig2.png|750px|center]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png|1200px|thumb|center|Performance of agents in the egg quest.]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png|400px|thumb|center|Results in the Troll Quest.]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[]]<br />
<br />
[[File:AEF_open_zork_comparison.png|600px|thumb|center|Results in "Open Zork".]]<br />
<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.<br />
<br />
For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.<br />
<br />
They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.<br />
<br />
=Critique=<br />
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO. <br />
<br />
=Reference=</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn&diff=41802learn what not to learn2018-11-29T06:01:40Z<p>X832zhan: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
In reinforcement learning, it is often difficult for an agent to learn when the action space is large, especially the difficulties from function approximation and exploration. In some cases many actions are irrelevant and it is sometimes easier for the algorithm to learn which action not to take. The paper proposes a new reinforcement learning approach for dealing with large action spaces based on action elimination by restricting the available actions in each state to a subset of the most likely ones. There is a core assumption that this method can be easier to predict which actions in each state are invalid or inferior and use that information to control. More specifically, it proposes a system that learns the approximation of a Q-function and concurrently learns to eliminate actions. The method utilizes an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played (e.g., Player: "Climb the tree." Parser: "There are no trees to climb"). Then a machine learning model can be trained to generalize to unseen states. <br />
<br />
The paper focuses on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a Deep Q-Network (DQN) and an Action Elimination Network (AEN), both using the Convolutional Neural Networks (CNN) for Natural Language Processing (NLP) tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''<br />
<br />
The text-based game called "Zork", which lets players to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AEN algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.<br />
<br />
Below shows an example for the Zork interface:<br />
<br />
[[File:lnottol_fig1.png|500px|center]]<br />
<br />
All states and actions are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.<br />
<br />
=Related Work=<br />
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.<br />
<br />
Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.<br />
<br />
DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.<br />
<br />
RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.<br />
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.<br />
<br />
=Action Elimination=<br />
<br />
The approach in the paper builds on the standard Reinforcement Learning formulation. At each time step <math>t</math>, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and observes next state <math display="inline">s_{t+1} </math> according to a transition kernel <math>P(s_{t+1}|s_t,a_t)</math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discounted cumulative return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]</math>. <br />
<br />
After executing an action, the agent observes a binary elimination signal <math>e(s, a)</math> to determine which actions not to take. It equals 1 if action <math>a</math> may be eliminated in state <math>s</math> (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following definitions:<br />
<br />
'''Definition 1:''' <br />
<br />
Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.<br />
<br />
'''Definition 2:'''<br />
<br />
Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.<br />
<br />
'''Definition 3:'''<br />
<br />
Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.<br />
<br />
==Advantages of Action Elimination==<br />
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity. <br />
<br />
Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation errors. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping (i.e., only the Q-values of the valid state-action pairs) leading to faster convergence and better solution.<br />
<br />
Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. Assume that there are <math>A'</math> actions that should be eliminated and are <math>\epsilon</math>-optimal, i.e. their value is at least <math>V^*(s)-\epsilon</math>. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.<br />
<br />
==Action elimination with contextual bandits==<br />
<br />
Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.<br />
<br />
<br />
According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math><br />
<br />
<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math><br />
<br />
Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:<br />
<br />
<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math><br />
<br />
with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.<br />
<br />
==Concurrent Learning==<br />
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space. <br />
<br />
If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.<br />
<br />
'''Proposition 1:'''<br />
<br />
Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2<br />
+1</math> times.<br />
<br />
Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.<br />
<br />
=Method=<br />
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.<br />
<br />
==Architecture of action elimination framework==<br />
<br />
[[File:lnottol_fig1b.png|300px|center]]<br />
<br />
After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]<br />
<br />
==Psuedocode of the Algorithm==<br />
<br />
[[File:lnottol_fig2.png|750px|center]]<br />
<br />
AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.<br />
<br />
=Experiment=<br />
==Zork domain==<br />
The world of Zork presents a rich environment with a large state and action space. <br />
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm. <br />
<br />
Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments. <br />
<br />
===Egg Quest===<br />
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.<br />
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.<br />
<br />
[[File:AEF_zork_comparison.png|1200px|thumb|center|Performance of agents in the egg quest.]]<br />
<br />
Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.<br />
<br />
===Troll Quest===<br />
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.<br />
<br />
[[File:AEF_troll_comparison.png|400px|thumb|center|Results in the Troll Quest.]]<br />
<br />
The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.<br />
<br />
===Open Zork===<br />
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.<br />
<br />
[[]]<br />
<br />
[[File:AEF_open_zork_comparison.png|600px|thumb|center|Results in "Open Zork".]]<br />
<br />
<br />
The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.<br />
<br />
=Conclusion=<br />
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.<br />
<br />
For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.<br />
<br />
They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.<br />
<br />
=Critique=<br />
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO. <br />
<br />
=Reference=</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41797DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-29T05:00:34Z<p>X832zhan: /* THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE */</p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [[https://arxiv.org/pdf/1711.00489.pdf]]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice to not have a singular steady learning rate for the learning phase of the neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, g decreases. In addition, increasing the batch size, has the same effect and makes g decays. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
4- In the early stage, large batch size will lead to the instabilities.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41796DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-29T04:54:06Z<p>X832zhan: /* SIMULATED ANNEALING AND THE GENERALIZATION GAP */</p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [[https://arxiv.org/pdf/1711.00489.pdf]]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice to not have a singular steady learning rate for the learning phase of the neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, g decreases. In addition, increasing the batch size, has the same effect and makes g decays. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.<br />
.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON%27T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE&diff=41794DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE2018-11-29T04:38:30Z<p>X832zhan: /* STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION */</p>
<hr />
<div>Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size ''' <br />
<br />
Link: [[https://arxiv.org/pdf/1711.00489.pdf]]<br />
<br />
Summarized by: Afify, Ahmed [ID: 20700841]<br />
<br />
==INTUITION==<br />
Nowadays, it is a common practice to not have a singular steady learning rate for the learning phase of the neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima it is beneficial for us to take large steps towards it as it would require a lesser number of steps to reach but as we approach it our step size should decrease otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.<br />
<br />
== INTRODUCTION ==<br />
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines. <br />
<br />
However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale g for a maximum test set accuracy.<br />
<br />
In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.<br />
<br />
== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==<br />
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)<br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum <br />
<br />
<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.<br />
<br />
These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and B is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, g decreases. In addition, increasing the batch size, has the same effect and makes g decays. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.<br />
<br />
== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==<br />
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.<br />
<br />
'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.<br />
<br />
Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well. <br />
<br />
Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. For instance, in physical sciences, decaying the temperature in discrete steps can make the system stuck in a local minimum while slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum.<br />
<br />
== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==<br />
'''The Effective Learning Rate''' <math> \epsilon_eff = \frac{\epsilon}{1-m} </math><br />
<br />
Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased. <br />
<br />
To understand the reasons behind this, we need to analyze momentum update equations below:<br />
<br />
<math><br />
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw} <br />
</math><br />
<br />
<math><br />
\Delta w = -A\epsilon<br />
</math><br />
<br />
We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed. Moreover, at high momentum, we have three challenges:<br />
<br />
1- Additional epochs are needed to catch up with the accumulation.<br />
<br />
2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients. <br />
<br />
3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.<br />
<br />
== EXPERIMENTS ==<br />
=== SIMULATED ANNEALING IN A WIDE RESNET ===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Schedules used as in the below figure:''' <br />
<br />
- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant<br />
<br />
- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.<br />
<br />
- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.<br />
<br />
[[File:Paper_40_Fig_1.png | 800px|center]]<br />
<br />
As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.<br />
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself<br />
[[File:Paper_40_Fig_2.png | 800px|center]] <br />
<br />
To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum<br />
[[File:Paper_40_Fig_3.png | 800px|center]]<br />
<br />
To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing <br />
[[File:Paper_40_Fig_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent<br />
<br />
=== INCREASING THE EFFECTIVE LEARNING RATE===<br />
<br />
'''Dataset:''' CIFAR-10 (50,000 training images)<br />
<br />
'''Network Architecture:''' “16-4” wide ResNet<br />
<br />
'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120<br />
<br />
'''Training Schedules:''' <br />
<br />
Four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.<br />
<br />
Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. <br />
<br />
Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step. <br />
<br />
Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.<br />
<br />
Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.<br />
<br />
The results of all training schedules, which are presented in the below figure, are documented in the following table:<br />
<br />
[[File:Paper_40_Table_1.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_5.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates<br />
<br />
=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===<br />
<br />
'''A) Experiment Goal:''' Control Batch Size<br />
<br />
'''Dataset:''' ImageNet (1.28 million training images)<br />
<br />
The paper modified the setup of Goyal et al. (2017), and used the following configuration:<br />
<br />
'''Network Architecture:''' Inception-ResNet-V2 <br />
<br />
'''Training Parameters:''' <br />
<br />
90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192<br />
<br />
Two training schedules were used:<br />
<br />
“Decaying learning rate”, where batch size is fixed and the learning rate is decayed<br />
<br />
“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.<br />
<br />
[[File:Paper_40_Table_2.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_6.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.<br />
<br />
'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient<br />
<br />
'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10. <br />
<br />
The below table shows the number of parameter updates and accuracy for different set of training parameters:<br />
<br />
[[File:Paper_40_Table_3.png | 800px|center]]<br />
<br />
[[File:Paper_40_Fig_7.png | 800px|center]]<br />
<br />
'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.<br />
<br />
=== TRAINING IMAGENET IN 30 MINUTES===<br />
<br />
'''Dataset:''' ImageNet (Already introduced in the previous section)<br />
<br />
'''Network Architecture:''' ResNet-50<br />
<br />
The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:<br />
<br />
[[File:Paper_40_Table_4.png | 800px|center]]<br />
<br />
'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.<br />
<br />
== RELATED WORK ==<br />
Main related work mentioned in the paper is as follows:<br />
<br />
- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.<br />
<br />
- Mandt et al. (2017) analyzed how SGD perform in Bayesian posterior sampling.<br />
<br />
- Keskar et al. (2016) focused on the analysis of noise once the training is started.<br />
<br />
- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.<br />
<br />
- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy. <br />
<br />
- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.<br />
<br />
== CONCLUSIONS ==<br />
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter m, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.<br />
<br />
== CRITIQUE ==<br />
'''Pros:'''<br />
<br />
- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.<br />
<br />
- Several experiments were performed on different optimizers such as SGD and Adam.<br />
<br />
- Had several comparisons with previous experimental setups.<br />
<br />
'''Cons:'''<br />
<br />
- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization. <br />
<br />
- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.<br />
<br />
- Special hardware is needed for large batch training, which is not always feasible.<br />
<br />
- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.<br />
<br />
- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.<br />
<br />
- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.<br />
<br />
- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types. <br />
<br />
== REFERENCES ==<br />
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.<br />
<br />
- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.<br />
<br />
- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.<br />
<br />
- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.<br />
<br />
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.<br />
<br />
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.<br />
<br />
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.<br />
<br />
- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.<br />
<br />
- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.<br />
<br />
- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.<br />
<br />
- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.<br />
<br />
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.<br />
<br />
- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.<br />
<br />
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.<br />
<br />
- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.<br />
<br />
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.<br />
<br />
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.<br />
<br />
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.<br />
<br />
- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.<br />
<br />
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.<br />
<br />
- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.<br />
<br />
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.<br />
<br />
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.<br />
<br />
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.<br />
<br />
- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.<br />
<br />
- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.<br />
<br />
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.<br />
<br />
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fix_your_classifier:_the_marginal_value_of_training_the_last_weight_layer&diff=41793Fix your classifier: the marginal value of training the last weight layer2018-11-29T04:20:58Z<p>X832zhan: /* Possible Caveats */</p>
<hr />
<div>=Introduction=<br />
<br />
Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have<br />
a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.<br />
<br />
=Brief Overview=<br />
<br />
In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.<br />
<br />
=Previous Work=<br />
<br />
Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:<br />
<br />
* Weight sharing and specification (Han et al., 2015)<br />
<br />
* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)<br />
<br />
* Low-rank approximations to speed up CNN (Tai et al., 2015)<br />
<br />
* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)<br />
<br />
Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.<br />
<br />
=Background=<br />
<br />
Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.<br />
<br />
== Shortcomings of the final classification layer and its solution ==<br />
<br />
Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).<br />
<br />
It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.<br />
<br />
=Proposed Method=<br />
<br />
The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).<br />
<br />
==Using a fixed classifier==<br />
<br />
Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.<br />
<br />
In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.<br />
<br />
For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×<br />
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation<br />
<br />
<math><br />
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }<br />
<br />
and reducing the expected negative log likelihood with respect to ground-truth target <math> t &isin; </math> { <math> {1, . . . , C} </math> },<br />
by minimizing the loss function:<br />
<br />
<math><br />
L(x, t) = − log {v_t} = −{w_t}·{x} − b_t + log ({\sum_{j}^{C}e^{w_j . x + b_j}})<br />
</math><br />
<br />
where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.<br />
<br />
==Choosing the projection matrix==<br />
<br />
To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q &isin; R^{N×C} </math>, such that <math> &forall; i &ne; j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition<br />
<br />
As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial<br />
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:<br />
<br />
<center><math><br />
\hat{x} = \frac{x}{||x||_{2}}<br />
</math></center><br />
<br />
This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b &isin; R^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:<br />
<br />
<center><br />
<math><br />
v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; </math> { <math> {1,...,C} </math>}<br />
</center><br />
<br />
and the loss to be minimized is:<br />
<br />
<center><math><br />
L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + log (\sum_{i=1}^{C} exp((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i)))<br />
</math></center><br />
<br />
where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t &isin; </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in<br />
[[Media: figure1_log_behave.png| Figure 1]].<br />
<br />
<center>[[File:figure1_log_behave.png]]</center><br />
<br />
When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is <br />
<br />
<center>[[File:caloss.png]]</center><br />
<br />
But its final validation accuracy has slight decrease, compared to original models.<br />
<br />
==Using a Hadmard matrix==<br />
<br />
To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.<br />
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} &isin; </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:<br />
<br />
<center><math><br />
y = \hat{H} \hat{x} + b<br />
</math></center><br />
<br />
This usage allows two main benefits:<br />
* A deterministic, low-memory and easily generated matrix that can be used for classification.<br />
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.<br />
<br />
Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.<br />
<br />
=Experimental Results=<br />
<br />
The authors have evaluated their proposed model on the following datasets:<br />
<br />
==CIFAR-10/100==<br />
<br />
===About the dataset===<br />
<br />
CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.<br />
<br />
===Training Details===<br />
<br />
The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].<br />
<br />
<center>[[File: figure1_resnet_cifar10.png]]</center><br />
<br />
These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors conjecture is that with the new fixed parameterization, the network can no longer increase the<br />
norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.<br />
<br />
The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier.<br />
<br />
<center>[[File: figure3_alpha_resnet_cifar.png]]</center><br />
<br />
The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.<br />
<br />
==IMAGENET==<br />
<br />
===About the dataset===<br />
<br />
The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.<br />
<br />
===Experiment Details===<br />
<br />
The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.<br />
<br />
For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.<br />
<br />
The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].<br />
<br />
<center>[[File: table1_fixed_results.png]]</center><br />
<br />
==Language Modelling==<br />
<br />
The authors also experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.<br />
<br />
<br />
=Discussion=<br />
<br />
==Implications and use cases==<br />
<br />
With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.<br />
<br />
==Possible Caveats==<br />
<br />
The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.<br />
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.<br />
<br />
==Future Work==<br />
<br />
<br />
The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.<br />
<br />
Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.<br />
<br />
A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]<br />
<br />
=Conclusion=<br />
<br />
In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.<br />
<br />
Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.<br />
<br />
=Critique=<br />
<br />
The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).<br />
<br />
=References=<br />
<br />
The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.<br />
<br />
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.<br />
<br />
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.<br />
<br />
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.<br />
<br />
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.<br />
<br />
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.<br />
<br />
Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.<br />
<br />
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.<br />
<br />
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.<br />
<br />
A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6<br />
(6):1184–1238, 1978.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.<br />
<br />
Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.<br />
<br />
Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.<br />
<br />
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.<br />
<br />
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.<br />
<br />
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.<br />
<br />
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.<br />
<br />
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.<br />
<br />
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.<br />
<br />
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.<br />
<br />
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.<br />
<br />
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.<br />
<br />
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.<br />
<br />
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.<br />
<br />
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.<br />
<br />
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.<br />
<br />
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.<br />
<br />
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.<br />
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.<br />
<br />
Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,<br />
pp. 157, 2017.<br />
<br />
Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.<br />
<br />
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.<br />
<br />
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.<br />
<br />
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.<br />
<br />
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.<br />
<br />
Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.<br />
<br />
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.<br />
<br />
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.<br />
<br />
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.<br />
<br />
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.<br />
<br />
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.<br />
<br />
Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.<br />
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.<br />
<br />
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.<br />
<br />
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.<br />
<br />
Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.<br />
<br />
A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:caloss.png&diff=41785File:caloss.png2018-11-29T02:37:48Z<p>X832zhan: X832zhan uploaded a new version of File:caloss.png</p>
<hr />
<div></div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:caloss.png&diff=41784File:caloss.png2018-11-29T02:37:13Z<p>X832zhan: </p>
<hr />
<div></div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fix_your_classifier:_the_marginal_value_of_training_the_last_weight_layer&diff=41783Fix your classifier: the marginal value of training the last weight layer2018-11-29T02:37:00Z<p>X832zhan: /* Choosing the projection matrix */</p>
<hr />
<div>=Introduction=<br />
<br />
Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have<br />
a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.<br />
<br />
=Brief Overview=<br />
<br />
In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.<br />
<br />
=Previous Work=<br />
<br />
Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:<br />
<br />
* Weight sharing and specification (Han et al., 2015)<br />
<br />
* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)<br />
<br />
* Low-rank approximations to speed up CNN (Tai et al., 2015)<br />
<br />
* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)<br />
<br />
Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.<br />
<br />
=Background=<br />
<br />
Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.<br />
<br />
== Shortcomings of the final classification layer and its solution ==<br />
<br />
Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).<br />
<br />
It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.<br />
<br />
=Proposed Method=<br />
<br />
The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).<br />
<br />
==Using a fixed classifier==<br />
<br />
Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.<br />
<br />
In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.<br />
<br />
For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×<br />
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation<br />
<br />
<math><br />
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }<br />
<br />
and reducing the expected negative log likelihood with respect to ground-truth target <math> t &isin; </math> { <math> {1, . . . , C} </math> },<br />
by minimizing the loss function:<br />
<br />
<math><br />
L(x, t) = − log {v_t} = −{w_t}·{x} − b_t + log ({\sum_{j}^{C}e^{w_j . x + b_j}})<br />
</math><br />
<br />
where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.<br />
<br />
==Choosing the projection matrix==<br />
<br />
To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q &isin; R^{N×C} </math>, such that <math> &forall; i &ne; j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition<br />
<br />
As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial<br />
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:<br />
<br />
<center><math><br />
\hat{x} = \frac{x}{||x||_{2}}<br />
</math></center><br />
<br />
This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b &isin; R^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:<br />
<br />
<center><br />
<math><br />
v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; </math> { <math> {1,...,C} </math>}<br />
</center><br />
<br />
and the loss to be minimized is:<br />
<br />
<center><math><br />
L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + log (\sum_{i=1}^{C} exp((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i)))<br />
</math></center><br />
<br />
where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t &isin; </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in<br />
[[Media: figure1_log_behave.png| Figure 1]].<br />
<br />
<center>[[File:figure1_log_behave.png]]</center><br />
<br />
When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is <br />
<br />
<center>[[File:caloss.png]]</center><br />
<br />
But its final validation accuracy has slight decrease, compared to original models.<br />
<br />
==Using a Hadmard matrix==<br />
<br />
To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.<br />
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} &isin; </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:<br />
<br />
<center><math><br />
y = \hat{H} \hat{x} + b<br />
</math></center><br />
<br />
This usage allows two main benefits:<br />
* A deterministic, low-memory and easily generated matrix that can be used for classification.<br />
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.<br />
<br />
Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.<br />
<br />
=Experimental Results=<br />
<br />
The authors have evaluated their proposed model on the following datasets:<br />
<br />
==CIFAR-10/100==<br />
<br />
===About the dataset===<br />
<br />
CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.<br />
<br />
===Training Details===<br />
<br />
The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].<br />
<br />
<center>[[File: figure1_resnet_cifar10.png]]</center><br />
<br />
These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors conjecture is that with the new fixed parameterization, the network can no longer increase the<br />
norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.<br />
<br />
The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier.<br />
<br />
<center>[[File: figure3_alpha_resnet_cifar.png]]</center><br />
<br />
The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.<br />
<br />
==IMAGENET==<br />
<br />
===About the dataset===<br />
<br />
The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.<br />
<br />
===Experiment Details===<br />
<br />
The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.<br />
<br />
For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.<br />
<br />
The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].<br />
<br />
<center>[[File: table1_fixed_results.png]]</center><br />
<br />
==Language Modelling==<br />
<br />
The authors also experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.<br />
<br />
<br />
=Discussion=<br />
<br />
==Implications and use cases==<br />
<br />
With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.<br />
<br />
==Possible Caveats==<br />
<br />
The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.<br />
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn.<br />
<br />
==Future Work==<br />
<br />
<br />
The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.<br />
<br />
Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.<br />
<br />
A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]<br />
<br />
=Conclusion=<br />
<br />
In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.<br />
<br />
Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.<br />
<br />
=Critique=<br />
<br />
The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).<br />
<br />
=References=<br />
<br />
The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.<br />
<br />
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.<br />
<br />
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.<br />
<br />
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.<br />
<br />
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.<br />
<br />
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.<br />
<br />
Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.<br />
<br />
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.<br />
<br />
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.<br />
<br />
A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6<br />
(6):1184–1238, 1978.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.<br />
<br />
Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.<br />
<br />
Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.<br />
<br />
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.<br />
<br />
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.<br />
<br />
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.<br />
<br />
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.<br />
<br />
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.<br />
<br />
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.<br />
<br />
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.<br />
<br />
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.<br />
<br />
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.<br />
<br />
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.<br />
<br />
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.<br />
<br />
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.<br />
<br />
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.<br />
<br />
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.<br />
<br />
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.<br />
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.<br />
<br />
Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,<br />
pp. 157, 2017.<br />
<br />
Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.<br />
<br />
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.<br />
<br />
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.<br />
<br />
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.<br />
<br />
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.<br />
<br />
Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.<br />
<br />
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.<br />
<br />
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.<br />
<br />
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.<br />
<br />
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.<br />
<br />
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.<br />
<br />
Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.<br />
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.<br />
<br />
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.<br />
<br />
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.<br />
<br />
Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.<br />
<br />
A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fix_your_classifier:_the_marginal_value_of_training_the_last_weight_layer&diff=41782Fix your classifier: the marginal value of training the last weight layer2018-11-29T02:30:41Z<p>X832zhan: /* Choosing the projection matrix */</p>
<hr />
<div>=Introduction=<br />
<br />
Deep neural networks have become a widely used model for machine learning, achieving state-of-the-art results on many tasks. The most common task these models are used for is to perform classification, as in the case of convolutional neural networks (CNNs) used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have<br />
a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources.<br />
<br />
=Brief Overview=<br />
<br />
In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (upto a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.<br />
<br />
=Previous Work=<br />
<br />
Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:<br />
<br />
* Weight sharing and specification (Han et al., 2015)<br />
<br />
* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)<br />
<br />
* Low-rank approximations to speed up CNN (Tai et al., 2015)<br />
<br />
* Quantization of weights, activations and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)<br />
<br />
Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.<br />
<br />
=Background=<br />
<br />
Convolutional neural networks (CNNs) are commonly used to solve a variety of spatial and temporal tasks. CNNs are usually composed of a stack of convolutional parameterized layers, spatial pooling layers and fully connected layers, separated by non-linear activation functions. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stage of the network, presumably to allow classification based on global features of an image.<br />
<br />
== Shortcomings of the final classification layer and its solution ==<br />
<br />
Despite the enormous number of trainable parameters these layers added to the model, they are known to have a rather marginal impact on the final performance of the network (Zeiler & Fergus, 2014).<br />
<br />
It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.<br />
<br />
=Proposed Method=<br />
<br />
The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).<br />
<br />
==Using a fixed classifier==<br />
<br />
Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.<br />
<br />
In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.<br />
<br />
For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×<br />
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation<br />
<br />
<math><br />
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i &isin; </math> { <math> {1, . . . , C} </math> }<br />
<br />
and reducing the expected negative log likelihood with respect to ground-truth target <math> t &isin; </math> { <math> {1, . . . , C} </math> },<br />
by minimizing the loss function:<br />
<br />
<math><br />
L(x, t) = − log {v_t} = −{w_t}·{x} − b_t + log ({\sum_{j}^{C}e^{w_j . x + b_j}})<br />
</math><br />
<br />
where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.<br />
<br />
==Choosing the projection matrix==<br />
<br />
To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q &isin; R^{N×C} </math>, such that <math> &forall; i &ne; j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition<br />
<br />
As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial<br />
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:<br />
<br />
<center><math><br />
\hat{x} = \frac{x}{||x||_{2}}<br />
</math></center><br />
<br />
This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b &isin; R^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:<br />
<br />
<center><br />
<math><br />
v_i=\frac{e^{\alpha q_i &middot; \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j &middot; \hat{x} + b_j}}, i &isin; </math> { <math> {1,...,C} </math>}<br />
</center><br />
<br />
and the loss to be minimized is:<br />
<br />
<center><math><br />
L(x, t) = -\alpha q_t &middot; \frac{x}{||x||_{2}} + b_t + log (\sum_{i=1}^{C} exp((\alpha q_i &middot; \frac{x}{||x||_{2}} + b_i)))<br />
</math></center><br />
<br />
where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t &isin; </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature, is shown in<br />
[[Media: figure1_log_behave.png| Figure 1]].<br />
<br />
<center>[[File:figure1_log_behave.png]]</center><br />
<br />
==Using a Hadmard matrix==<br />
<br />
To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.<br />
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} &isin; </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:<br />
<br />
<center><math><br />
y = \hat{H} \hat{x} + b<br />
</math></center><br />
<br />
This usage allows two main benefits:<br />
* A deterministic, low-memory and easily generated matrix that can be used for classification.<br />
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.<br />
<br />
Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.<br />
<br />
=Experimental Results=<br />
<br />
The authors have evaluated their proposed model on the following datasets:<br />
<br />
==CIFAR-10/100==<br />
<br />
===About the dataset===<br />
<br />
CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of same size, but contains 100 different classes.<br />
<br />
===Training Details===<br />
<br />
The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].<br />
<br />
<center>[[File: figure1_resnet_cifar10.png]]</center><br />
<br />
These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors conjecture is that with the new fixed parameterization, the network can no longer increase the<br />
norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.<br />
<br />
The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier.<br />
<br />
<center>[[File: figure3_alpha_resnet_cifar.png]]</center><br />
<br />
The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.<br />
<br />
==IMAGENET==<br />
<br />
===About the dataset===<br />
<br />
The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.<br />
<br />
===Experiment Details===<br />
<br />
The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.<br />
<br />
For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy.<br />
<br />
The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].<br />
<br />
<center>[[File: table1_fixed_results.png]]</center><br />
<br />
==Language Modelling==<br />
<br />
The authors also experimented with fix-classifiers on language modelling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.<br />
<br />
<br />
=Discussion=<br />
<br />
==Implications and use cases==<br />
<br />
With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.<br />
<br />
==Possible Caveats==<br />
<br />
The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.<br />
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn.<br />
<br />
==Future Work==<br />
<br />
<br />
The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.<br />
<br />
Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.<br />
<br />
A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]<br />
<br />
=Conclusion=<br />
<br />
In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.<br />
<br />
Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - upto the final classifier, as it seems highly redundant.<br />
<br />
=Critique=<br />
<br />
The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).<br />
<br />
=References=<br />
<br />
The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.<br />
<br />
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.<br />
<br />
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.<br />
<br />
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.<br />
<br />
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.<br />
<br />
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.<br />
<br />
Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.<br />
<br />
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.<br />
<br />
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.<br />
<br />
A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6<br />
(6):1184–1238, 1978.<br />
<br />
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.<br />
<br />
Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.<br />
<br />
Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.<br />
<br />
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.<br />
<br />
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.<br />
<br />
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.<br />
<br />
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.<br />
<br />
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.<br />
<br />
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.<br />
<br />
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.<br />
<br />
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.<br />
<br />
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.<br />
<br />
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.<br />
<br />
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.<br />
<br />
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.<br />
<br />
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.<br />
<br />
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.<br />
<br />
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.<br />
<br />
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.<br />
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.<br />
<br />
Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,<br />
pp. 157, 2017.<br />
<br />
Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.<br />
<br />
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.<br />
<br />
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.<br />
<br />
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.<br />
<br />
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.<br />
<br />
Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.<br />
<br />
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.<br />
<br />
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.<br />
<br />
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.<br />
<br />
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.<br />
<br />
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.<br />
<br />
Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.<br />
<br />
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.<br />
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.<br />
<br />
Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.<br />
<br />
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.<br />
<br />
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.<br />
<br />
Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.<br />
<br />
A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ShakeDrop_Regularization&diff=41781ShakeDrop Regularization2018-11-29T02:20:44Z<p>X832zhan: /* Experiments */</p>
<hr />
<div>=Introduction=<br />
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.<br />
<br />
=Existing Methods=<br />
<br />
'''Deep Approaches'''<br />
<br />
'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.<br />
<br />
[[File:ResidualBlock.png|600px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]<br />
<br />
ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.<br />
<br />
'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.<br />
<br />
[[File:ResidualBlockComparison.png|900px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017]]<br />
<br />
<br />
'''Non-Deep Approaches'''<br />
<br />
'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.<br />
<br />
'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.<br />
<br />
<br />
[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]<br />
<br />
<br />
'''Regularization Methods'''<br />
<br />
'''Stochastic Depth''' helped address the issue of vanishing gradients in ResNet. It works by randomly dropping residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. <br />
<br />
'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt architecture. It can be given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. <math>\alpha</math> is used during the forward pass, and another identically distributed random parameter <math>\beta</math> is used in the backward pass. This caused one of the two paired convolution operations to be dropped, and further improved ResNeXt.<br />
<br />
[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]<br />
<br />
=Proposed Method=<br />
This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.<br />
<br />
This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two. <br />
<br />
'''ShakeDrop''' is given as <br />
<br />
<div align="center"><br />
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,<br />
</div><br />
<br />
where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is <br />
<br />
<div align="center"><br />
<math><br />
G(x) = \begin{cases}<br />
x + F(x) ~~ \text{if } b_l = 1 \\<br />
x + \alpha F(x) ~~ \text{otherwise}<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.<br />
<br />
=Experiments=<br />
<br />
'''Parameter Search'''<br />
<br />
The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below. <br />
<br />
[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]<br />
<br />
The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.<br />
<br />
[[File:ParameterUpdateShakeDrop.png|400px|centre]]<br />
<br />
Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.<br />
<br />
'''Comparison with Regularization Methods'''<br />
<br />
For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch). <br />
<br />
These experiments are performed on the CIFAR-100 dataset.<br />
<br />
[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]<br />
<br />
For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing. <br />
<br />
[[File:CosineAnnealing.png|400px|centre|thumb|]]<br />
<br />
The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017). <br />
<br />
[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]<br />
<br />
'''State-of-the-Art Comparisons'''<br />
<br />
A direct comparison with state of the art methods is favorable for this new method. <br />
<br />
# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.<br />
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.<br />
# Comparison with the state-of-the-arts, PyramidNet + ShakeDrop gives an improvement of 0.25% on CIFAR-10 than ResNeXt + Shake-Shake + Cutout, PyramidNet + ShakeDrop gives an improvement of 2.85% on CIFAR-100 than Coupled Ensemble.<br />
<br />
=Conclusion=<br />
<br />
This paper proposed a new stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.<br />
<br />
=References=<br />
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.<br />
<br />
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.<br />
<br />
[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.<br />
<br />
[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.<br />
<br />
[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.<br />
<br />
[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.<br />
<br />
[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.<br />
<br />
[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.<br />
<br />
[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.<br />
<br />
[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.<br />
<br />
[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ShakeDrop_Regularization&diff=41780ShakeDrop Regularization2018-11-29T02:12:52Z<p>X832zhan: /* Proposed Method */</p>
<hr />
<div>=Introduction=<br />
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.<br />
<br />
=Existing Methods=<br />
<br />
'''Deep Approaches'''<br />
<br />
'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.<br />
<br />
[[File:ResidualBlock.png|600px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]<br />
<br />
ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.<br />
<br />
'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.<br />
<br />
[[File:ResidualBlockComparison.png|900px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017]]<br />
<br />
<br />
'''Non-Deep Approaches'''<br />
<br />
'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.<br />
<br />
'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.<br />
<br />
<br />
[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]<br />
<br />
<br />
'''Regularization Methods'''<br />
<br />
'''Stochastic Depth''' helped address the issue of vanishing gradients in ResNet. It works by randomly dropping residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. <br />
<br />
'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt architecture. It can be given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. <math>\alpha</math> is used during the forward pass, and another identically distributed random parameter <math>\beta</math> is used in the backward pass. This caused one of the two paired convolution operations to be dropped, and further improved ResNeXt.<br />
<br />
[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]<br />
<br />
=Proposed Method=<br />
This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.<br />
<br />
This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two. <br />
<br />
'''ShakeDrop''' is given as <br />
<br />
<div align="center"><br />
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,<br />
</div><br />
<br />
where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is <br />
<br />
<div align="center"><br />
<math><br />
G(x) = \begin{cases}<br />
x + F(x) ~~ \text{if } b_l = 1 \\<br />
x + \alpha F(x) ~~ \text{otherwise}<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.<br />
<br />
=Experiments=<br />
<br />
'''Parameter Search'''<br />
<br />
The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below. <br />
<br />
[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]<br />
<br />
The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.<br />
<br />
[[File:ParameterUpdateShakeDrop.png|400px|centre]]<br />
<br />
Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.<br />
<br />
'''Comparison with Regularization Methods'''<br />
<br />
For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch). <br />
<br />
These experiments are performed on the CIFAR-100 dataset.<br />
<br />
[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]<br />
<br />
For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing. <br />
<br />
[[File:CosineAnnealing.png|400px|centre|thumb|]]<br />
<br />
The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017). <br />
<br />
[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]<br />
<br />
'''State-of-the-Art Comparisons'''<br />
<br />
A direct comparison with state of the art methods is favorable for this new method. <br />
<br />
# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.<br />
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.<br />
<br />
=Conclusion=<br />
<br />
This paper proposed a new stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.<br />
<br />
=References=<br />
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.<br />
<br />
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.<br />
<br />
[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.<br />
<br />
[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.<br />
<br />
[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.<br />
<br />
[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.<br />
<br />
[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.<br />
<br />
[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.<br />
<br />
[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.<br />
<br />
[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.<br />
<br />
[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ShakeDrop_Regularization&diff=41779ShakeDrop Regularization2018-11-29T01:58:46Z<p>X832zhan: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.<br />
<br />
=Existing Methods=<br />
<br />
'''Deep Approaches'''<br />
<br />
'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.<br />
<br />
[[File:ResidualBlock.png|600px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]<br />
<br />
ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.<br />
<br />
'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.<br />
<br />
[[File:ResidualBlockComparison.png|900px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017]]<br />
<br />
<br />
'''Non-Deep Approaches'''<br />
<br />
'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.<br />
<br />
'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.<br />
<br />
<br />
[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]<br />
<br />
<br />
'''Regularization Methods'''<br />
<br />
'''Stochastic Depth''' helped address the issue of vanishing gradients in ResNet. It works by randomly dropping residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter. <br />
<br />
'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt architecture. It can be given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. <math>\alpha</math> is used during the forward pass, and another identically distributed random parameter <math>\beta</math> is used in the backward pass. This caused one of the two paired convolution operations to be dropped, and further improved ResNeXt.<br />
<br />
[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]<br />
<br />
=Proposed Method=<br />
This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.<br />
<br />
This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When the non regularized network is selected, learning is promoted, when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two. <br />
<br />
'''ShakeDrop''' is given as <br />
<br />
<div align="center"><br />
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,<br />
</div><br />
<br />
where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is <br />
<br />
<div align="center"><br />
<math><br />
G(x) = \begin{cases}<br />
x + F(x) ~~ \text{if } b_l = 1 \\<br />
x + \alpha F(x) ~~ \text{otherwise}<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.<br />
<br />
=Experiments=<br />
<br />
'''Parameter Search'''<br />
<br />
The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below. <br />
<br />
[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]<br />
<br />
The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.<br />
<br />
[[File:ParameterUpdateShakeDrop.png|400px|centre]]<br />
<br />
Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.<br />
<br />
'''Comparison with Regularization Methods'''<br />
<br />
For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch). <br />
<br />
These experiments are performed on the CIFAR-100 dataset.<br />
<br />
[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]<br />
<br />
[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]<br />
<br />
For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing. <br />
<br />
[[File:CosineAnnealing.png|400px|centre|thumb|]]<br />
<br />
The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017). <br />
<br />
[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]<br />
<br />
'''State-of-the-Art Comparisons'''<br />
<br />
A direct comparison with state of the art methods is favorable for this new method. <br />
<br />
# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.<br />
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.<br />
<br />
=Conclusion=<br />
<br />
This paper proposed a new stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.<br />
<br />
=References=<br />
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.<br />
<br />
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.<br />
<br />
[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.<br />
<br />
[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.<br />
<br />
[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.<br />
<br />
[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.<br />
<br />
[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.<br />
<br />
[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.<br />
<br />
[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.<br />
<br />
[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.<br />
<br />
[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.<br />
<br />
[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Autoregressive_Convolutional_Neural_Networks_for_Asynchronous_Time_Series&diff=41127stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series2018-11-23T06:04:13Z<p>X832zhan: /* Gating and weighting mechanisms */</p>
<hr />
<div>This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018.<br />
<br />
=Introduction=<br />
In this paper, the authors proposed a deep convolutional network architecture called Significance-Offset Convolutional Neural Network, for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks, and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and UCI household electricity consumption dataset. This paper focused on time series with multivariate and noisy signals, especially the financial data. Financial time series are challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, same signal (e.g. price of stock) is obtained from different sources (e.g. financial news, investment bank, financial analyst etc.) in asynchronous moment of time. Each source has different different bias and noise.(Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that changes in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships.<br />
<br />
The time series forecasting problem can be expressed as a conditional probability distribution below, we focused on modeling the predictors of future values of time series given their past: <br />
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div><br />
The predictability of financial dataset still remains an open problem and is discussed in various publications. ([2])<br />
<br />
[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same CDS2 throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]<br />
<br />
=Related Work=<br />
===Time series forecasting===<br />
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.<br />
<br />
Although deep neural networks have been applied into many fields and produced satisfactory results, there still are little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate. <br />
<br />
In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.<br />
<br />
===Gating and weighting mechanisms===<br />
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. This composition of functions may lead to popular recurrent architecture such as LSTM and GRU[11].<br />
<br />
The idea of the gating system is aimed to weight outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.<br />
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x), p(x)=softmax(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs(composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs. <br />
<br />
This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the method is similar as this. The difference is that modelling the functions as multi-layer CNNs. Another difference is that not using recurrent layers, which can enable the network to remember the parts of the sentence/image already translated/described.<br />
<br />
=Motivation=<br />
There are mainly five motivations they stated in the paper:<br />
#The forecasting problem in this paper has done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics are more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.<br />
#Although Gaussian processes provide useful theoretical framework that is able to handle asynchronous data, they often follow heavy-tailed distribution for financial datasets.<br />
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.<br />
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).<br />
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the durations, but this approach may lead to imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.<br />
<br />
[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]<br />
<br />
<br />
=Model Architecture=<br />
Suppose there's a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math><br />
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | {x_{n-m}, m=1,2,...}], </math></div><br />
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.<br />
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. The estimator of <math>y_n</math> can be expressed as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div><br />
This is summation of the columns of the matrix in bracket, where<br />
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. S is a fully convolutional network which is composed of convolutional layers only. <math>F</math> is in the form of<br />
<math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [off(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> where <math> W \in \mathbb{R}^{d_I \times M}</math> and <math> off: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.<br />
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T,...,a_{d_I}^T)^T)=(\sigma(a_1)^T,...\sigma(a_{d_I})^T)^T </math><br />
# <math>\otimes</math> is element-wise matrix multiplication.<br />
#<math>A.,_m</math> denotes the m-th column of a matrix A, and <math>\sum_{m=1}^M A.,_m=A(1,1,...,1)^T</math>.<br />
Since <math>\sum_{m=1}^M W.,_m=W(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div><br />
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>off</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.<br />
Figure 3 shows the scheme of network.<br />
<br />
[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]<br />
<br />
The form of <math>\hat{y}_n</math> forced to separate the temporal dependence (obtained in weights <math>W_m</math>). S is determined by its filters which capture local dependencies and are independent of the relative position in time, the predictors <math>off(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals, and might be heterogenous, therefore adjustment of offset network is important.In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.<br />
<br />
===Relation to asynchronous data===<br />
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem<br />
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.<br />
#Add additional features: Treating duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).<br />
<br />
===Loss function===<br />
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:<br />
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div><br />
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:<br />
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div><br />
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.<br />
<br />
=Experiments=<br />
The paper evaluated SOCNN architecture on three datasets: artificial generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]<br />
<br />
==Datasets==<br />
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.<br />
<br />
Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.<br />
<br />
Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction we want to predict the next quoted price from this given source and direction considering the last 60 quotes.<br />
<br />
==Training details==<br />
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.<br />
<br />
They chose LeakyReLU as activation function for all networks:<br />
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div><br />
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.<br />
<br />
Table 1 presents the configuration of network hyperparameters used in comparison<br />
<br />
[[File:Junyi4.png | 400px|center|]]<br />
<br />
===Network Training===<br />
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio 3 to 1. The remaining 20% of data was used as a test set.<br />
<br />
All models were trained using Adam optimizer, because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.<br />
<br />
They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization in between each convolution and the following activation. <br />
<br />
At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.<br />
<br />
Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]<br />
<br />
The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.<br />
<br />
==Results==<br />
Table 2 shows all results performed from all datasets.<br />
[[File:Junyi5.png | 600px|center|]]<br />
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.<br />
[[File:Junyi6.png | 400px|center|]]<br />
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.<br />
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]<br />
<br />
Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.<br />
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]<br />
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.<br />
<br />
=Conclusion and Discussion=<br />
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series, and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks. <br />
<br />
The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets.<br />
<br />
=Critiques=<br />
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.<br />
#The quote data cannot be reached, only two datasets available.<br />
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.<br />
#The transform of the original data to asynchronous data is not clear.<br />
#The experiments on the main application are not reproducible because the data is proprietary.<br />
<br />
=Reference=<br />
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994. <br />
<br />
[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.<br />
<br />
[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.<br />
<br />
[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.<br />
<br />
[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.<br />
<br />
[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.<br />
<br />
[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.<br />
<br />
[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.<br />
<br />
[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.<br />
<br />
[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.<br />
<br />
[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.<br />
<br />
[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.<br />
<br />
[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.<br />
<br />
[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Autoregressive_Convolutional_Neural_Networks_for_Asynchronous_Time_Series&diff=41122stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series2018-11-23T05:55:45Z<p>X832zhan: /* Time series forecasting */</p>
<hr />
<div>This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018.<br />
<br />
=Introduction=<br />
In this paper, the authors proposed a deep convolutional network architecture called Significance-Offset Convolutional Neural Network, for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks, and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and UCI household electricity consumption dataset. This paper focused on time series with multivariate and noisy signals, especially the financial data. Financial time series are challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, same signal (e.g. price of stock) is obtained from different sources (e.g. financial news, investment bank, financial analyst etc.) in asynchronous moment of time. Each source has different different bias and noise.(Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that changes in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships.<br />
<br />
The time series forecasting problem can be expressed as a conditional probability distribution below, we focused on modeling the predictors of future values of time series given their past: <br />
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div><br />
The predictability of financial dataset still remains an open problem and is discussed in various publications. ([2])<br />
<br />
[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same CDS2 throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]<br />
<br />
=Related Work=<br />
===Time series forecasting===<br />
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.<br />
<br />
Although deep neural networks have been applied into many fields and produced satisfactory results, there still are little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate. <br />
<br />
In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.<br />
<br />
===Gating and weighting mechanisms===<br />
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. This composition of functions may lead to popular recurrent architecture such as LSTM and GRU[11].<br />
<br />
The idea of the gating system is aimed to weight outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.<br />
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x), p(x)=softmax(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs(composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs. This idea is also used in attention networks[13] such as image captioning and machine translation.<br />
<br />
=Motivation=<br />
There are mainly five motivations they stated in the paper:<br />
#The forecasting problem in this paper has done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics are more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.<br />
#Although Gaussian processes provide useful theoretical framework that is able to handle asynchronous data, they often follow heavy-tailed distribution for financial datasets.<br />
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.<br />
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).<br />
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the durations, but this approach may lead to imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.<br />
<br />
[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]<br />
<br />
<br />
=Model Architecture=<br />
Suppose there's a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math><br />
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | {x_{n-m}, m=1,2,...}], </math></div><br />
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.<br />
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. The estimator of <math>y_n</math> can be expressed as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div><br />
This is summation of the columns of the matrix in bracket, where<br />
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. S is a fully convolutional network which is composed of convolutional layers only. <math>F</math> is in the form of<br />
<math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [off(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> where <math> W \in \mathbb{R}^{d_I \times M}</math> and <math> off: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.<br />
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T,...,a_{d_I}^T)^T)=(\sigma(a_1)^T,...\sigma(a_{d_I})^T)^T </math><br />
# <math>\otimes</math> is element-wise matrix multiplication.<br />
#<math>A.,_m</math> denotes the m-th column of a matrix A, and <math>\sum_{m=1}^M A.,_m=A(1,1,...,1)^T</math>.<br />
Since <math>\sum_{m=1}^M W.,_m=W(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div><br />
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>off</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.<br />
Figure 3 shows the scheme of network.<br />
<br />
[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]<br />
<br />
The form of <math>\hat{y}_n</math> forced to separate the temporal dependence (obtained in weights <math>W_m</math>). S is determined by its filters which capture local dependencies and are independent of the relative position in time, the predictors <math>off(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals, and might be heterogenous, therefore adjustment of offset network is important.In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.<br />
<br />
===Relation to asynchronous data===<br />
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem<br />
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.<br />
#Add additional features: Treating duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).<br />
<br />
===Loss function===<br />
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:<br />
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div><br />
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:<br />
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div><br />
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.<br />
<br />
=Experiments=<br />
The paper evaluated SOCNN architecture on three datasets: artificial generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]<br />
<br />
==Datasets==<br />
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.<br />
<br />
Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.<br />
<br />
Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction we want to predict the next quoted price from this given source and direction considering the last 60 quotes.<br />
<br />
==Training details==<br />
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.<br />
<br />
They chose LeakyReLU as activation function for all networks:<br />
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div><br />
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.<br />
<br />
Table 1 presents the configuration of network hyperparameters used in comparison<br />
<br />
[[File:Junyi4.png | 400px|center|]]<br />
<br />
===Network Training===<br />
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio 3 to 1. The remaining 20% of data was used as a test set.<br />
<br />
All models were trained using Adam optimizer, because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.<br />
<br />
They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization in between each convolution and the following activation. <br />
<br />
At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.<br />
<br />
Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]<br />
<br />
The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.<br />
<br />
==Results==<br />
Table 2 shows all results performed from all datasets.<br />
[[File:Junyi5.png | 600px|center|]]<br />
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.<br />
[[File:Junyi6.png | 400px|center|]]<br />
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.<br />
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]<br />
<br />
Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.<br />
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]<br />
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.<br />
<br />
=Conclusion and Discussion=<br />
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series, and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks. <br />
<br />
The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets.<br />
<br />
=Critiques=<br />
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.<br />
#The quote data cannot be reached, only two datasets available.<br />
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.<br />
#The transform of the original data to asynchronous data is not clear.<br />
#The experiments on the main application are not reproducible because the data is proprietary.<br />
<br />
=Reference=<br />
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994. <br />
<br />
[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.<br />
<br />
[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.<br />
<br />
[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.<br />
<br />
[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.<br />
<br />
[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.<br />
<br />
[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.<br />
<br />
[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.<br />
<br />
[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.<br />
<br />
[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.<br />
<br />
[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.<br />
<br />
[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.<br />
<br />
[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.<br />
<br />
[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Autoregressive_Convolutional_Neural_Networks_for_Asynchronous_Time_Series&diff=41121stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series2018-11-23T05:40:52Z<p>X832zhan: /* Network Training */</p>
<hr />
<div>This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018.<br />
<br />
=Introduction=<br />
In this paper, the authors proposed a deep convolutional network architecture called Significance-Offset Convolutional Neural Network, for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks, and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and UCI household electricity consumption dataset. This paper focused on time series with multivariate and noisy signals, especially the financial data. Financial time series are challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, same signal (e.g. price of stock) is obtained from different sources (e.g. financial news, investment bank, financial analyst etc.) in asynchronous moment of time. Each source has different different bias and noise.(Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that changes in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships.<br />
<br />
The time series forecasting problem can be expressed as a conditional probability distribution below, we focused on modeling the predictors of future values of time series given their past: <br />
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div><br />
The predictability of financial dataset still remains an open problem and is discussed in various publications. ([2])<br />
<br />
[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same CDS2 throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]<br />
<br />
=Related Work=<br />
===Time series forecasting===<br />
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6].<br />
<br />
More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.<br />
<br />
===Gating and weighting mechanisms===<br />
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. This composition of functions may lead to popular recurrent architecture such as LSTM and GRU[11].<br />
<br />
The idea of the gating system is aimed to weight outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.<br />
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x), p(x)=softmax(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs(composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs. This idea is also used in attention networks[13] such as image captioning and machine translation.<br />
<br />
=Motivation=<br />
There are mainly five motivations they stated in the paper:<br />
#The forecasting problem in this paper has done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics are more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.<br />
#Although Gaussian processes provide useful theoretical framework that is able to handle asynchronous data, they often follow heavy-tailed distribution for financial datasets.<br />
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.<br />
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).<br />
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the durations, but this approach may lead to imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.<br />
<br />
[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]<br />
<br />
<br />
=Model Architecture=<br />
Suppose there's a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math><br />
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | {x_{n-m}, m=1,2,...}], </math></div><br />
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.<br />
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. The estimator of <math>y_n</math> can be expressed as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div><br />
This is summation of the columns of the matrix in bracket, where<br />
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. S is a fully convolutional network which is composed of convolutional layers only. <math>F</math> is in the form of<br />
<math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [off(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> where <math> W \in \mathbb{R}^{d_I \times M}</math> and <math> off: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.<br />
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T,...,a_{d_I}^T)^T)=(\sigma(a_1)^T,...\sigma(a_{d_I})^T)^T </math><br />
# <math>\otimes</math> is element-wise matrix multiplication.<br />
#<math>A.,_m</math> denotes the m-th column of a matrix A, and <math>\sum_{m=1}^M A.,_m=A(1,1,...,1)^T</math>.<br />
Since <math>\sum_{m=1}^M W.,_m=W(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:<br />
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div><br />
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>off</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.<br />
Figure 3 shows the scheme of network.<br />
<br />
[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]<br />
<br />
The form of <math>\hat{y}_n</math> forced to separate the temporal dependence (obtained in weights <math>W_m</math>). S is determined by its filters which capture local dependencies and are independent of the relative position in time, the predictors <math>off(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals, and might be heterogenous, therefore adjustment of offset network is important.In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.<br />
<br />
===Relation to asynchronous data===<br />
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem<br />
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.<br />
#Add additional features: Treating duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).<br />
<br />
===Loss function===<br />
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:<br />
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div><br />
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:<br />
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div><br />
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.<br />
<br />
=Experiments=<br />
The paper evaluated SOCNN architecture on three datasets: artificial generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]<br />
<br />
==Datasets==<br />
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.<br />
<br />
Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.<br />
<br />
Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction we want to predict the next quoted price from this given source and direction considering the last 60 quotes.<br />
<br />
==Training details==<br />
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.<br />
<br />
They chose LeakyReLU as activation function for all networks:<br />
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div><br />
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.<br />
<br />
Table 1 presents the configuration of network hyperparameters used in comparison<br />
<br />
[[File:Junyi4.png | 400px|center|]]<br />
<br />
===Network Training===<br />
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio 3 to 1. The remaining 20% of data was used as a test set.<br />
<br />
All models were trained using Adam optimizer, because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.<br />
<br />
They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization in between each convolution and the following activation. <br />
<br />
At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.<br />
<br />
Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]<br />
<br />
The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.<br />
<br />
==Results==<br />
Table 2 shows all results performed from all datasets.<br />
[[File:Junyi5.png | 600px|center|]]<br />
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.<br />
[[File:Junyi6.png | 400px|center|]]<br />
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.<br />
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]<br />
<br />
Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.<br />
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]<br />
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.<br />
<br />
=Conclusion and Discussion=<br />
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series, and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks. <br />
<br />
The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets.<br />
<br />
=Critiques=<br />
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.<br />
#The quote data cannot be reached, only two datasets available.<br />
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.<br />
#The transform of the original data to asynchronous data is not clear.<br />
#The experiments on the main application are not reproducible because the data is proprietary.<br />
<br />
=Reference=<br />
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994. <br />
<br />
[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.<br />
<br />
[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.<br />
<br />
[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.<br />
<br />
[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.<br />
<br />
[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.<br />
<br />
[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.<br />
<br />
[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.<br />
<br />
[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.<br />
<br />
[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.<br />
<br />
[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.<br />
<br />
[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.<br />
<br />
[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.<br />
<br />
[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation&diff=41004Unsupervised Neural Machine Translation2018-11-22T21:15:34Z<p>X832zhan: /* Supervised */</p>
<hr />
<div>This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.<br />
<br />
= Introduction =<br />
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.<br />
<br />
Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.<br />
<br />
The general approach of the methodology is to:<br />
<br />
# Use monolingual corpora in the source and target languages to learn source and target word embeddings.<br />
# Align the 2 sets of word embeddings in the same latent space.<br />
Then iteratively perform:<br />
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.<br />
# Tune the decoder in each language by back-translating between the source and target language.<br />
<br />
= Background =<br />
<br />
===Word Embedding Alignment===<br />
<br />
The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2. <br />
<br />
Figure 1 shows an example of aligning the word embeddings in English and French.<br />
<br />
[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]<br />
<br />
Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.<br />
<br />
The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. <br />
<br />
===Other related work and inspirations===<br />
====STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION====<br />
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and models the distribution of the ciphertext. This approach is able to take advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.<br />
<br />
====LOW-RESOURCE NEURAL MACHINE TRANSLATION====<br />
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.<br />
<br />
Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.<br />
<br />
The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.<br />
<br />
= Methodology =<br />
<br />
The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.<br />
<br />
The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.<br />
<br />
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.<br />
<br />
Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:<br />
<br />
#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.<br />
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language. <br />
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.<br />
<br />
[[File:Figure2_lwali.png|600px|center]]<br />
<br />
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.<br />
<br />
===Denoising===<br />
<br />
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both<br />
languages in a language-independent fashion, and then be decoded by the language dependent decoder.<br />
<br />
Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:<br />
<br />
# Construct <math>C(x)</math>, noisy version of <math>x</math>,<br />
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.<br />
<br />
The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.<br />
<br />
===Back-Translation===<br />
<br />
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:<br />
<br />
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,<br />
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,<br />
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.<br />
<br />
===Training===<br />
<br />
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. <br />
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.<br />
<br />
Optimizer choice and other hyperparameters can be found in the paper.<br />
<br />
=Experiments and Results=<br />
<br />
The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.<br />
<br />
The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.<br />
<br />
[[File:Table1_lwali.png|600px|center]]<br />
<br />
===Unsupervised===<br />
<br />
The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.<br />
<br />
The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.<br />
<br />
For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.<br />
<br />
===Semi-supervised===<br />
<br />
Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.<br />
<br />
Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.<br />
<br />
===Supervised===<br />
<br />
This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.<br />
<br />
The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.<br />
<br />
===Qualitative Analysis===<br />
<br />
[[File:Table2_lwali.png|600px|center]]<br />
<br />
Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.<br />
<br />
=Conclusions and Future Work=<br />
<br />
The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.<br />
<br />
Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:<br />
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.<br />
*Decouple the shared encoder into 2 independent encoders at some point during training<br />
*Progressively reduce the noise level<br />
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis<br />
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.<br />
<br />
= Critique =<br />
<br />
While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution. <br />
<br />
The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.<br />
<br />
Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results. <br />
<br />
The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.<br />
<br />
The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.<br />
<br />
* ([https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.<br />
<br />
= References =<br />
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."<br />
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".<br />
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."<br />
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."<br />
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."<br />
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."<br />
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."<br />
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Neural_Machine_Translation&diff=41003Unsupervised Neural Machine Translation2018-11-22T21:07:55Z<p>X832zhan: /* Qualitative Analysis */</p>
<hr />
<div>This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho.<br />
<br />
= Introduction =<br />
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.<br />
<br />
Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data.<br />
<br />
The general approach of the methodology is to:<br />
<br />
# Use monolingual corpora in the source and target languages to learn source and target word embeddings.<br />
# Align the 2 sets of word embeddings in the same latent space.<br />
Then iteratively perform:<br />
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.<br />
# Tune the decoder in each language by back-translating between the source and target language.<br />
<br />
= Background =<br />
<br />
===Word Embedding Alignment===<br />
<br />
The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2. <br />
<br />
Figure 1 shows an example of aligning the word embeddings in English and French.<br />
<br />
[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]<br />
<br />
Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.<br />
<br />
The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. <br />
<br />
===Other related work and inspirations===<br />
====STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION====<br />
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and models the distribution of the ciphertext. This approach is able to take advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.<br />
<br />
====LOW-RESOURCE NEURAL MACHINE TRANSLATION====<br />
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.<br />
<br />
Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.<br />
<br />
The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.<br />
<br />
= Methodology =<br />
<br />
The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.<br />
<br />
The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.<br />
<br />
The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different by language.<br />
<br />
Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:<br />
<br />
#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.<br />
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language. <br />
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.<br />
<br />
[[File:Figure2_lwali.png|600px|center]]<br />
<br />
The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.<br />
<br />
===Denoising===<br />
<br />
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both<br />
languages in a language-independent fashion, and then be decoded by the language dependent decoder.<br />
<br />
Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:<br />
<br />
# Construct <math>C(x)</math>, noisy version of <math>x</math>,<br />
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.<br />
<br />
The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.<br />
<br />
===Back-Translation===<br />
<br />
With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:<br />
<br />
# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,<br />
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,<br />
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.<br />
<br />
The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.<br />
<br />
Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.<br />
<br />
===Training===<br />
<br />
Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence. <br />
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.<br />
<br />
Optimizer choice and other hyperparameters can be found in the paper.<br />
<br />
=Experiments and Results=<br />
<br />
The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.<br />
<br />
The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.<br />
<br />
[[File:Table1_lwali.png|600px|center]]<br />
<br />
===Unsupervised===<br />
<br />
The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.<br />
<br />
The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.<br />
<br />
For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.<br />
<br />
===Semi-supervised===<br />
<br />
Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.<br />
<br />
Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.<br />
<br />
===Supervised===<br />
<br />
This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014. <br />
<br />
The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance.<br />
<br />
===Qualitative Analysis===<br />
<br />
[[File:Table2_lwali.png|600px|center]]<br />
<br />
Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.<br />
<br />
=Conclusions and Future Work=<br />
<br />
The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.<br />
<br />
Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:<br />
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.<br />
*Decouple the shared encoder into 2 independent encoders at some point during training<br />
*Progressively reduce the noise level<br />
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis<br />
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.<br />
<br />
= Critique =<br />
<br />
While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution. <br />
<br />
The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.<br />
<br />
Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results. <br />
<br />
The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.<br />
<br />
The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.<br />
<br />
* ([https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.<br />
<br />
= References =<br />
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."<br />
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".<br />
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."<br />
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."<br />
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."<br />
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."<br />
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."<br />
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=40977Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-22T19:43:03Z<p>X832zhan: /* Testing and Results */</p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. <br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the gameplay, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Gameplay ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixe friction coefficient was predefined in the simulation. The behaviour of the stones was also modelled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. We can use policy gradient reinforcement learning to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.<br />
<br />
[[File:mcts_uct_equation.png | 500px | centered]]<br />
<br />
Sources: 2,3,4<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging. Given two items of data, '''x''', each of which has a value '''y''' associated with them, the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 350 px]]<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
[[File:curling_network_layers.png]]<br />
<br />
<br />
the input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 500px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
[[File:curling_KR_test.png | 400px]]<br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. The figure below shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors.<br />
<br />
[[File:curling_ratings.png | 400px]]<br />
<br />
= Critique =<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
I found this paper difficult to follow at times. One problem was that the algorithms were introduced first, and then how they were used as described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run, and what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparing of different algorithms was done well, I believe it still lacked some good detail. There were one-off mentions in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limiting. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling&diff=40975Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling2018-11-22T19:35:38Z<p>X832zhan: /* Policy and Value Functions */</p>
<hr />
<div>This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]<br />
<br />
= Introduction and Motivation =<br />
<br />
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. <br />
<br />
Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement. <br />
<br />
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.<br />
<br />
Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.<br />
<br />
== Curling ==<br />
<br />
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the gameplay, and potential challenges/concerns for learning algorithms. A terminology section follows.<br />
<br />
=== Gameplay ===<br />
<br />
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.<br />
<br />
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).<br />
<br />
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:<br />
<br />
* Draw: Throw a rock to a target location<br />
* Freeze: Draw a rock up against another rock<br />
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions<br />
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)<br />
<br />
=== Challenges for AI ===<br />
<br />
Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.<br />
<br />
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.<br />
<br />
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.<br />
<br />
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.<br />
<br />
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.<br />
<br />
=== Terminology ===<br />
<br />
* End: A round of the game<br />
* House: The end of the sheet of ice, which contains<br />
* Hammer: The team that throws the last rock of an end 'has the hammer'<br />
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.<br />
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.<br />
<br />
<br />
== Related Work ==<br />
<br />
=== AlphaGo Lee ===<br />
<br />
AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.<br />
<br />
The use of both policy and value networks are reflected in this paper's work.<br />
<br />
=== AlphaGo Zero ===<br />
<br />
AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.<br />
<br />
The unification of networks and self-play are also reflected in this paper.<br />
<br />
=== Curling Algorithms ===<br />
<br />
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers. <br />
<br />
With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.<br />
<br />
=== Curling Physics and Simulation ===<br />
<br />
Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixe friction coefficient was predefined in the simulation. The behaviour of the stones was also modelled. Important parameters are trained from professional players. The authors used the same parameters in this paper.<br />
<br />
== General Background of Algorithms ==<br />
<br />
=== Policy and Value Functions ===<br />
<br />
A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.<br />
<br />
POLICY IMPROVEMENT: LEARNING ACTION POLICY<br />
<br />
Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. We can use policy gradient reinforcement learning to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,<br />
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]<br />
where <math> r(s_t) </math> is the return.<br />
<br />
POLICY EVALUATION: LEARNING VALUE FUNCTIONS<br />
<br />
A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,<br />
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]<br />
<br />
=== Monte Carlo Tree Search ===<br />
<br />
Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).<br />
<br />
MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.<br />
<br />
Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.<br />
<br />
MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.<br />
<br />
The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.<br />
<br />
Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.<br />
<br />
The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.<br />
<br />
A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.<br />
<br />
[[File:mcts_uct_equation.png | 500px | centered]]<br />
<br />
Sources: 2,3,4<br />
<br />
=== Kernel Regression ===<br />
<br />
Kernel regression is a form of weighted averaging. Given two items of data, '''x''', each of which has a value '''y''' associated with them, the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.<br />
<br />
A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).<br />
<br />
[[File:gaussian_kernel.png | 400 px]]<br />
<br />
[[File:kernel_regression.png | 350 px]]<br />
<br />
In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.<br />
<br />
= Methods =<br />
<br />
== Variable Definitions ==<br />
<br />
The following variables are used often in the paper:<br />
<br />
* <math>s</math>: A state in the game, as described below as the input to the network.<br />
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game<br />
* <math>a_t</math>: The action taken in state <math>s_t</math><br />
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS<br />
* <math>n_{a_t}</math>: The number of visits to node a in MCTS<br />
* <math>v_{a_t}</math>: The MCTS value estimate of a node<br />
<br />
== Network Design ==<br />
<br />
The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.<br />
<br />
=== Shared Structure ===<br />
<br />
The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:<br />
<br />
[[File:curling_network_layers.png]]<br />
<br />
<br />
the input to this network is the following:<br />
* Location of stones<br />
* Order to tee (the center of the sheet)<br />
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.<br />
<br />
The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.<br />
<br />
=== Policy Network ===<br />
<br />
The policy head is created by adding 2 convolutional layers with 2 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.<br />
<br />
=== Value Network ===<br />
<br />
The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.<br />
<br />
== Continuous Action Search ==<br />
<br />
The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.<br />
<br />
Actions in the continuous space are generated using an MCTS algorithm, with the following steps:<br />
<br />
=== Selection ===<br />
<br />
From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.<br />
<br />
[[File:curling_kernel_equations.png | 500px]]<br />
<br />
The UCB formula is then used to select an action to expand.<br />
<br />
The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.<br />
<br />
=== Expansion ===<br />
<br />
The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times.<br />
<br />
=== Simulation ===<br />
<br />
Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.<br />
<br />
=== Backpropogation ===<br />
<br />
Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.<br />
<br />
<br />
== Supervised Learning ==<br />
<br />
During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.<br />
<br />
=== Policy Network ===<br />
<br />
The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.<br />
<br />
=== Value Network ===<br />
<br />
The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.<br />
<br />
=== Policy-Value Network ===<br />
<br />
The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:<br />
<br />
* Algorithm: stochastic gradient descent<br />
* Batch size: 256<br />
* Momentum: 0.9<br />
* L2 regularization: 0.0001<br />
* Training time: ~100 epochs<br />
* Learning rate: initialized at 0.01, reduced twice<br />
<br />
A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:<br />
<br />
[[File:curling_loss_function.png | 300px]]<br />
<br />
== Self-Play Reinforcement Learning ==<br />
<br />
After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.<br />
<br />
At a game state ''s<sub>t</sub>'':<br />
<br />
1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.<br />
<br />
2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.<br />
<br />
It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.<br />
<br />
The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.<br />
<br />
It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.<br />
<br />
== Long-Term Strategy Learning ==<br />
<br />
Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.<br />
<br />
The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.<br />
<br />
== Final Algorithms ==<br />
<br />
The authors make use of the following versions of their algorithm:<br />
<br />
=== KR-DL ===<br />
<br />
''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.<br />
<br />
=== KR-DRL ===<br />
<br />
''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.<br />
<br />
=== KR-DRL-MES ===<br />
<br />
''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.<br />
<br />
= Testing and Results =<br />
<br />
Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.<br />
<br />
== Comparison of KR-DL-UCT and DL-UCT ==<br />
<br />
The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.<br />
<br />
As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.<br />
<br />
[[File:curling_KR_test.png | 400px]]<br />
<br />
== Matches ==<br />
<br />
Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. The figure below shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors.<br />
<br />
[[File:curling_ratings.png | 400px]]<br />
<br />
= Critique =<br />
<br />
== Strengths ==<br />
<br />
This algorithm out-performs other high-performance algorithms (including past competition champions).<br />
<br />
I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.<br />
<br />
The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.<br />
<br />
== Weaknesses ==<br />
<br />
I found this paper difficult to follow at times. One problem was that the algorithms were introduced first, and then how they were used as described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run, and what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.<br />
<br />
While I think the comparing of different algorithms was done well, I believe it still lacked some good detail. There were one-off mentions in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.<br />
<br />
At this point, the algorithms used still rely on initialization by a pre-made program.<br />
<br />
There was little theoretical development or justification done in this paper.<br />
<br />
While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limiting. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.<br />
<br />
While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.<br />
<br />
=References=<br />
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)<br />
# https://www.baeldung.com/java-monte-carlo-tree-search<br />
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/<br />
# https://int8.io/monte-carlo-tree-search-beginners-guide/<br />
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.<br />
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.<br />
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.<br />
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.<br />
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=40965a neural representation of sketch drawings2018-11-22T19:08:33Z<p>X832zhan: /* Predicting Different Endings of Incomplete Sketches */</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, The authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation. <br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot and some reinforcement learning approaches. They work more like a mimic of digitized photographs. There are some Neural network based approaches too, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models or Mixture Density Networks to generate human sketches, continuous data points or vectorized Kanji characters.<br />
<br />
The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.<br />
<br />
The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.<br />
<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by a online game Quick Draw! where the players are required to draw objects belonging to a particular object class in less than 20 seconds.. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. <math>p_{1}, p_{2}, p_{3})</math> are three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). The encoder is a bidirectional RNN, the input is a sketch sequence and a reversed sketch sequence, so there will be two final hidden states. The output is a size <math>N_{z}</math> latent vector.<br />
<br />
\begin{align*}<br />
h_{ \rightarrow} = encode_{ \rightarrow }(S), <br />
h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), <br />
h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{align*}<br />
<br />
Then the authors project <math>h</math> into to <math>\mu</math> and <math>\hat{\sigma}</math>, convert <math>\mu</math> into non-negative and use them with <math>\mathcal{N}(0,I)</math> to construct a random vector <math>z\in\mathbb{R}^{N_{z}}</math>.<br />
<br />
\begin{align*}<br />
\mu = W_\mu h + b_\mu, <br />
\hat \sigma = W_\sigma h + b_\sigma, <br />
\sigma = exp( \frac{\hat \sigma}{2}), <br />
z = \mu + \sigma \odot \mathcal{N}(0,I).<br />
\end{align*}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a conditioned random vector.<br />
<br />
The decoder is an autoregressive RNN. The initial hidden states are generated using <math>[h_0;c_0] = \tanh(W_z z+b_z)</math>. <math>S_0</math> is <math>(0,0,1,0,0)</math> For each step i in the decoder, the input <math>x_i</math> is the concatenation of previous point <math>S_{i-1}</math> and latent vector <math>z</math>. The output are probability distribution parameters for the next data point <math>S_i</math>. The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model <math>(p_1, p_2, p_3)</math> as categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2 and q_3</math> sum up to 1. The generated sequence is conditioned from the latent vector <math>z</math> that sampled from the encoder, which is end-to-end trained together with the decoder.<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi _j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M} = 1<br />
\end{align*}<br />
<br />
Here the <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is the probability distribution function for <math>x,y</math>, <math>\rho_{xy}</math>is the correlation parameter for this bivariate normal distribution. The <math>\Pi</math> is a lenth M categorical distribution vector are the mixture weights of the Gaussian mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{align*}<br />
x_i = [S_{i-1}; z], <br />
[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), <br />
y_i = W_y h_i + b_y,<br />
y_i \in \mathbb{R}^{6M+3}.<br />
\end{align*}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_1 (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_2 ... (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_M (\hat q_1 \hat q_2 \hat q_3)] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations will be applied to standard deviations to ensure they are non-negative and between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x), <br />
\sigma_y = \exp (\hat \sigma_y), <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard to do decide when to stop drawing because <math>(p_1, p_2, p_3)</math> is very unbalanced. scholars in the past used different weights for each pen event probability, but the authors have a better idea. They define a hyperparameter representing the max length of the longest sketch in the training set <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist on the on the peak of the probability density function.<br />
<br />
[[File:sketchfig3.png|700px]]<br />
<br />
=== Unconditional Generation ===<br />
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter τ = 0.2 at the top in blue, to τ = 0.9 at the bottom in red.<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an IID Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>.<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly.<br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set τ = 0.8 to complete samples. Figure 7 shows the results.<br />
<br />
[[File:sketchfig7.png|700px]]<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs.<br />
<br />
This model may also find its place on teaching students how to draw. When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical sketch. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
It exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches. <br />
<br />
== Conclusion ==<br />
This paper introduced an interesting model sketch-rnn that can encode and decode sketches, generate and complete unfinished sketches. The authors demonstrated how to interpolate between latent spaces from a different class and how to use it to augment sketches or generate similar looking sketches. They also showed that it's important to enforce a prior distribution on latent vector while interpolating coherent sketch generations. Finally, they created a large sketch drawings dataset to be used in future research.<br />
<br />
== Critique ==<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment.<br />
<br />
* Same problem as the output, the authors didn't present an evaluation for the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance.<br />
<br />
* I understand that using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=40963a neural representation of sketch drawings2018-11-22T19:04:53Z<p>X832zhan: /* Training */</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, The authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation. <br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot and some reinforcement learning approaches. They work more like a mimic of digitized photographs. There are some Neural network based approaches too, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models or Mixture Density Networks to generate human sketches, continuous data points or vectorized Kanji characters.<br />
<br />
The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.<br />
<br />
The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.<br />
<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by a online game Quick Draw! where the players are required to draw objects belonging to a particular object class in less than 20 seconds.. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. <math>p_{1}, p_{2}, p_{3})</math> are three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). The encoder is a bidirectional RNN, the input is a sketch sequence and a reversed sketch sequence, so there will be two final hidden states. The output is a size <math>N_{z}</math> latent vector.<br />
<br />
\begin{align*}<br />
h_{ \rightarrow} = encode_{ \rightarrow }(S), <br />
h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), <br />
h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{align*}<br />
<br />
Then the authors project <math>h</math> into to <math>\mu</math> and <math>\hat{\sigma}</math>, convert <math>\mu</math> into non-negative and use them with <math>\mathcal{N}(0,I)</math> to construct a random vector <math>z\in\mathbb{R}^{N_{z}}</math>.<br />
<br />
\begin{align*}<br />
\mu = W_\mu h + b_\mu, <br />
\hat \sigma = W_\sigma h + b_\sigma, <br />
\sigma = exp( \frac{\hat \sigma}{2}), <br />
z = \mu + \sigma \odot \mathcal{N}(0,I).<br />
\end{align*}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a conditioned random vector.<br />
<br />
The decoder is an autoregressive RNN. The initial hidden states are generated using <math>[h_0;c_0] = \tanh(W_z z+b_z)</math>. <math>S_0</math> is <math>(0,0,1,0,0)</math> For each step i in the decoder, the input <math>x_i</math> is the concatenation of previous point <math>S_{i-1}</math> and latent vector <math>z</math>. The output are probability distribution parameters for the next data point <math>S_i</math>. The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model <math>(p_1, p_2, p_3)</math> as categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2 and q_3</math> sum up to 1. The generated sequence is conditioned from the latent vector <math>z</math> that sampled from the encoder, which is end-to-end trained together with the decoder.<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi _j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M} = 1<br />
\end{align*}<br />
<br />
Here the <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is the probability distribution function for <math>x,y</math>, <math>\rho_{xy}</math>is the correlation parameter for this bivariate normal distribution. The <math>\Pi</math> is a lenth M categorical distribution vector are the mixture weights of the Gaussian mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{align*}<br />
x_i = [S_{i-1}; z], <br />
[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), <br />
y_i = W_y h_i + b_y,<br />
y_i \in \mathbb{R}^{6M+3}.<br />
\end{align*}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_1 (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_2 ... (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_M (\hat q_1 \hat q_2 \hat q_3)] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations will be applied to standard deviations to ensure they are non-negative and between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x), <br />
\sigma_y = \exp (\hat \sigma_y), <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard to do decide when to stop drawing because <math>(p_1, p_2, p_3)</math> is very unbalanced. scholars in the past used different weights for each pen event probability, but the authors have a better idea. They define a hyperparameter representing the max length of the longest sketch in the training set <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist on the on the peak of the probability density function.<br />
<br />
[[File:sketchfig3.png|700px]]<br />
<br />
=== Unconditional Generation ===<br />
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter τ = 0.2 at the top in blue, to τ = 0.9 at the bottom in red.<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an IID Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>.<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly.<br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch.<br />
<br />
[[File:sketchfig7.png|700px]]<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs.<br />
<br />
This model may also find its place on teaching students how to draw. When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical sketch. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
It exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches. <br />
<br />
== Conclusion ==<br />
This paper introduced an interesting model sketch-rnn that can encode and decode sketches, generate and complete unfinished sketches. The authors demonstrated how to interpolate between latent spaces from a different class and how to use it to augment sketches or generate similar looking sketches. They also showed that it's important to enforce a prior distribution on latent vector while interpolating coherent sketch generations. Finally, they created a large sketch drawings dataset to be used in future research.<br />
<br />
== Critique ==<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment.<br />
<br />
* Same problem as the output, the authors didn't present an evaluation for the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance.<br />
<br />
* I understand that using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=40962a neural representation of sketch drawings2018-11-22T19:01:57Z<p>X832zhan: /* Unconditional Generation */</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, The authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation. <br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot and some reinforcement learning approaches. They work more like a mimic of digitized photographs. There are some Neural network based approaches too, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models or Mixture Density Networks to generate human sketches, continuous data points or vectorized Kanji characters.<br />
<br />
The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.<br />
<br />
The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.<br />
<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by a online game Quick Draw! where the players are required to draw objects belonging to a particular object class in less than 20 seconds.. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. <math>p_{1}, p_{2}, p_{3})</math> are three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). The encoder is a bidirectional RNN, the input is a sketch sequence and a reversed sketch sequence, so there will be two final hidden states. The output is a size <math>N_{z}</math> latent vector.<br />
<br />
\begin{align*}<br />
h_{ \rightarrow} = encode_{ \rightarrow }(S), <br />
h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), <br />
h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{align*}<br />
<br />
Then the authors project <math>h</math> into to <math>\mu</math> and <math>\hat{\sigma}</math>, convert <math>\mu</math> into non-negative and use them with <math>\mathcal{N}(0,I)</math> to construct a random vector <math>z\in\mathbb{R}^{N_{z}}</math>.<br />
<br />
\begin{align*}<br />
\mu = W_\mu h + b_\mu, <br />
\hat \sigma = W_\sigma h + b_\sigma, <br />
\sigma = exp( \frac{\hat \sigma}{2}), <br />
z = \mu + \sigma \odot \mathcal{N}(0,I).<br />
\end{align*}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a conditioned random vector.<br />
<br />
The decoder is an autoregressive RNN. The initial hidden states are generated using <math>[h_0;c_0] = \tanh(W_z z+b_z)</math>. <math>S_0</math> is <math>(0,0,1,0,0)</math> For each step i in the decoder, the input <math>x_i</math> is the concatenation of previous point <math>S_{i-1}</math> and latent vector <math>z</math>. The output are probability distribution parameters for the next data point <math>S_i</math>. The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model <math>(p_1, p_2, p_3)</math> as categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2 and q_3</math> sum up to 1. The generated sequence is conditioned from the latent vector <math>z</math> that sampled from the encoder, which is end-to-end trained together with the decoder.<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi _j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M} = 1<br />
\end{align*}<br />
<br />
Here the <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is the probability distribution function for <math>x,y</math>, <math>\rho_{xy}</math>is the correlation parameter for this bivariate normal distribution. The <math>\Pi</math> is a lenth M categorical distribution vector are the mixture weights of the Gaussian mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{align*}<br />
x_i = [S_{i-1}; z], <br />
[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), <br />
y_i = W_y h_i + b_y,<br />
y_i \in \mathbb{R}^{6M+3}.<br />
\end{align*}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_1 (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_2 ... (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_M (\hat q_1 \hat q_2 \hat q_3)] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations will be applied to standard deviations to ensure they are non-negative and between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x), <br />
\sigma_y = \exp (\hat \sigma_y), <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard to do decide when to stop drawing because <math>(p_1, p_2, p_3)</math> is very unbalanced. scholars in the past used different weights for each pen event probability, but the authors have a better idea. They define a hyperparameter representing the max length of the longest sketch in the training set <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist on the on the peak of the probability density function.<br />
<br />
[[File:sketchfig3.png|700px]]<br />
<br />
=== Unconditional Generation ===<br />
There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter τ = 0.2 at the top in blue, to τ = 0.9 at the bottom in red.<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an IID Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator.<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly.<br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch.<br />
<br />
[[File:sketchfig7.png|700px]]<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs.<br />
<br />
This model may also find its place on teaching students how to draw. When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical sketch. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
It exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches. <br />
<br />
== Conclusion ==<br />
This paper introduced an interesting model sketch-rnn that can encode and decode sketches, generate and complete unfinished sketches. The authors demonstrated how to interpolate between latent spaces from a different class and how to use it to augment sketches or generate similar looking sketches. They also showed that it's important to enforce a prior distribution on latent vector while interpolating coherent sketch generations. Finally, they created a large sketch drawings dataset to be used in future research.<br />
<br />
== Critique ==<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment.<br />
<br />
* Same problem as the output, the authors didn't present an evaluation for the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance.<br />
<br />
* I understand that using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings&diff=40960a neural representation of sketch drawings2018-11-22T18:53:32Z<p>X832zhan: /* Dataset */</p>
<hr />
<div><br />
== Introduction ==<br />
In this paper, The authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.<br />
<br />
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. However, people learn to draw using sequences of strokes, beginning when they are young. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do. <br />
<br />
The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).<br />
<br />
=== Terminology ===<br />
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp. <br />
<br />
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images. <br />
<br />
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images. <br />
<br />
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation. <br />
<br />
== Related Work ==<br />
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot and some reinforcement learning approaches. They work more like a mimic of digitized photographs. There are some Neural network based approaches too, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models or Mixture Density Networks to generate human sketches, continuous data points or vectorized Kanji characters.<br />
<br />
The model also allows us to explore the latent space representation of vector images. There are previous works that achieved similar functions as well, such as combining Sequence-to-Sequence models with Variational Autoencoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset.<br />
<br />
The dataset they use contains 50 million vector sketches. Before this paper, there is a Sketch data with 20k vector sketches, a Sketchy dataset with 70k vector sketches along with pixel images, and a ShadowDraw system that used 30k raster images along with extracted vectorized features. They are all comparatively small.<br />
<br />
<br />
== Methodology ==<br />
=== Dataset ===<br />
QuickDraw is a dataset with 50 million vector drawings collected by a online game Quick Draw! where the players are required to draw objects belonging to a particular object class in less than 20 seconds.. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples and 2.5k test samples.<br />
<br />
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. <math>p_{1}, p_{2}, p_{3})</math> are three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.<br />
<br />
=== Sketch-RNN ===<br />
[[File:sketchfig2.png|700px]]<br />
<br />
The model is a Sequence-to-Sequence Variational Autoencoder(VAE). The encoder is a bidirectional RNN, the input is a sketch sequence and a reversed sketch sequence, so there will be two final hidden states. The output is a size <math>N_{z}</math> latent vector.<br />
<br />
\begin{align*}<br />
h_{ \rightarrow} = encode_{ \rightarrow }(S), <br />
h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), <br />
h = [h_{\rightarrow}; h_{\leftarrow}].<br />
\end{align*}<br />
<br />
Then the authors project <math>h</math> into to <math>\mu</math> and <math>\hat{\sigma}</math>, convert <math>\mu</math> into non-negative and use them with <math>\mathcal{N}(0,I)</math> to construct a random vector <math>z\in\mathbb{R}^{N_{z}}</math>.<br />
<br />
\begin{align*}<br />
\mu = W_\mu h + b_\mu, <br />
\hat \sigma = W_\sigma h + b_\sigma, <br />
\sigma = exp( \frac{\hat \sigma}{2}), <br />
z = \mu + \sigma \odot \mathcal{N}(0,I).<br />
\end{align*}<br />
<br />
<br />
Note that <math>z</math> is not deterministic but a conditioned random vector.<br />
<br />
The decoder is an autoregressive RNN. The initial hidden states are generated using <math>[h_0;c_0] = \tanh(W_z z+b_z)</math>. <math>S_0</math> is <math>(0,0,1,0,0)</math> For each step i in the decoder, the input <math>x_i</math> is the concatenation of previous point <math>S_{i-1}</math> and latent vector <math>z</math>. The output are probability distribution parameters for the next data point <math>S_i</math>. The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model <math>(p_1, p_2, p_3)</math> as categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2 and q_3</math> sum up to 1. The generated sequence is conditioned from the latent vector <math>z</math> that sampled from the encoder, which is end-to-end trained together with the decoder.<br />
<br />
\begin{align*}<br />
p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi _j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M} = 1<br />
\end{align*}<br />
<br />
Here the <math>\mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is the probability distribution function for <math>x,y</math>, <math>\rho_{xy}</math>is the correlation parameter for this bivariate normal distribution. The <math>\Pi</math> is a lenth M categorical distribution vector are the mixture weights of the Gaussian mixture model.<br />
<br />
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.<br />
<br />
\begin{align*}<br />
x_i = [S_{i-1}; z], <br />
[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), <br />
y_i = W_y h_i + b_y,<br />
y_i \in \mathbb{R}^{6M+3}.<br />
\end{align*}<br />
<br />
The output consists the probability distribution of the next data point.<br />
<br />
\begin{align*}<br />
[(\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_1 (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_2 ... (\hat \Pi \mu_x \mu_y \hat\sigma_x \hat \sigma_y \hat \rho_{xy})_M (\hat q_1 \hat q_2 \hat q_3)] = y_i<br />
\end{align*}<br />
<br />
<math>\exp</math> and <math>\tanh</math> operations will be applied to standard deviations to ensure they are non-negative and between -1 and 1.<br />
<br />
\begin{align*}<br />
\sigma_x = \exp (\hat \sigma_x), <br />
\sigma_y = \exp (\hat \sigma_y), <br />
\rho_{xy} = \tanh(\hat \rho_{xy}). <br />
\end{align*}<br />
<br />
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :<br />
<br />
\begin{align*}<br />
q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},<br />
k \in \left\{1,2,3\right\}, <br />
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},<br />
k \in \left\{1,...,M\right\}.<br />
\end{align*}<br />
<br />
It is hard to do decide when to stop drawing because <math>(p_1, p_2, p_3)</math> is very unbalanced. scholars in the past used different weights for each pen event probability, but the authors have a better idea. They define a hyperparameter representing the max length of the longest sketch in the training set <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.<br />
<br />
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.<br />
<br />
\begin{align*}<br />
\hat q_k \rightarrow \frac{\hat q_k}{\tau}, <br />
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, <br />
\sigma_x^2 \rightarrow \sigma_x^2\tau, <br />
\sigma_y^2 \rightarrow \sigma_y^2\tau. <br />
\end{align*}<br />
<br />
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist on the on the peak of the probability density function.<br />
<br />
[[File:sketchfig3.png|700px]]<br />
<br />
=== Unconditional Generation ===<br />
The decoder RNN could work as a standalone autoregressive model. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>.<br />
<br />
=== Training ===<br />
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.<br />
<br />
\begin{align*}<br />
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), <br />
\end{align*}<br />
\begin{align*}<br />
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), <br />
L_R = L_s + L_p.<br />
\end{align*}<br />
<br />
<br />
Both terms are normalized by <math>N_{max}</math>.<br />
<br />
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an IID Gaussian vector with zero mean and unit variance.<br />
<br />
\begin{align*}<br />
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))<br />
\end{align*}<br />
<br />
The overall loss is weighted as:<br />
<br />
\begin{align*}<br />
Loss = L_R + w_{KL} L_{KL}<br />
\end{align*}<br />
<br />
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator.<br />
<br />
== Experiments ==<br />
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. They also conduct multi-class datasets. The result is as follows.<br />
<br />
[[File:sketchtable1.png|700px]]<br />
<br />
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly.<br />
<br />
=== Conditional Reconstruction ===<br />
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random.<br />
<br />
[[File:sketchfig5.png|700px]]<br />
<br />
They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.<br />
<br />
=== Latent Space Interpolation ===<br />
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. With high <math>w_{KL}</math> values, the generated images are more coherently interpolated.<br />
<br />
[[File:sketchfig6.png|700px]]<br />
<br />
=== Sketch Drawing Analogies ===<br />
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values. The authors are able to perform vector arithmetic on latent vectors from different sketches and explore how the model generates sketches base on these latent spaces.<br />
<br />
=== Predicting Different Endings of Incomplete Sketches === <br />
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch.<br />
<br />
[[File:sketchfig7.png|700px]]<br />
<br />
== Applications and Future Work ==<br />
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs.<br />
<br />
This model may also find its place on teaching students how to draw. When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical sketch. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.<br />
<br />
It exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches. <br />
<br />
== Conclusion ==<br />
This paper introduced an interesting model sketch-rnn that can encode and decode sketches, generate and complete unfinished sketches. The authors demonstrated how to interpolate between latent spaces from a different class and how to use it to augment sketches or generate similar looking sketches. They also showed that it's important to enforce a prior distribution on latent vector while interpolating coherent sketch generations. Finally, they created a large sketch drawings dataset to be used in future research.<br />
<br />
== Critique ==<br />
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment.<br />
<br />
* Same problem as the output, the authors didn't present an evaluation for the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance.<br />
<br />
* I understand that using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.<br />
<br />
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.<br />
<br />
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!<br />
<br />
== References == <br />
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.<br />
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.<br />
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.<br />
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.<br />
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.<br />
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.<br />
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.<br />
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.<br />
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.<br />
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.<br />
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.<br />
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.<br />
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.<br />
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.<br />
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.<br />
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.<br />
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.<br />
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.<br />
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.<br />
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.<br />
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.<br />
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.<br />
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.<br />
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.<br />
# T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.<br />
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.<br />
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map&diff=40849Learning to Navigate in Cities Without a Map2018-11-22T07:48:07Z<p>X832zhan: /* Results */</p>
<hr />
<div>Paper: <br />
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]<br />
A video of the paper is available here[https://sites.google.com/view/streetlearn].<br />
<br />
== Introduction ==<br />
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.<br />
<br />
1. Building an explicit map<br />
<br />
2. Planning and acting using that map. <br />
<br />
In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.<br />
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (neither map nor current position have not been used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]<br />
<br />
==Contribution==<br />
This paper has made the following contributions:<br />
<br />
1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.<br />
<br />
2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.<br />
<br />
3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.<br />
<br />
4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.<br />
<br />
==Related Work==<br />
<br />
1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment. <br />
<br />
2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.<br />
<br />
3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.<br />
<br />
==Environment==<br />
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes<br />
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to<br />
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.<br />
<br />
[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]<br />
<br />
==Agent Interface and the Courier Task==<br />
In an RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action). The most central edge is chosen if there are multiple edges in the agents viewing cone.<br />
<br />
There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp(-αd_{(t,i)}^g)}{∑_k exp(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).<br />
<br />
[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]<br />
<br />
This form of representation has several advantages: <br />
<br />
1. It could easily be extended to new environments.<br />
<br />
2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.<br />
<br />
3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.<br />
<br />
In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.<br />
<br />
==Methods==<br />
<br />
===Goal-dependent Actor-Critic Reinforcement Learning===<br />
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} &rarr; \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of<br />
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>: <br />
<br />
\begin{align}<br />
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)<br />
\end{align}<br />
<br />
Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:<br />
<br />
\begin{align}<br />
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]<br />
\end{align}<br />
<br />
Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.<br />
<br />
===Architectures===<br />
<br />
[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]<br />
<br />
The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.<br />
<br />
Three different architectures are described below.<br />
<br />
The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.<br />
<br />
The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information. <br />
<br />
The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.<br />
<br />
In this paper, the authors use IMPALA to train the agents because IMPALA can get similar performance to A3C. The authors use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50.<br />
<br />
===Curriculum Learning===<br />
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)<br />
<br />
==Results==<br />
In this section, the performance of the proposed architectures on the courier task is shown.<br />
<br />
[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way<br />
as in London) are comparable and the agent achieved mean goal reward 426.]]<br />
<br />
It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:<br />
<br />
1. Goal Navigation agent.<br />
<br />
2. City Navigation Agent.<br />
<br />
3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.<br />
<br />
Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.<br />
<br />
The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.<br />
<br />
[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]<br />
<br />
Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.<br />
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach<br />
a goal (Washington Square in New York or St Paul’s Cathedral in<br />
London) from 100 start locations vs. the straight-line distance to<br />
the goal in meters. One agent step corresponds to a forward movement<br />
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]<br />
<br />
The authors mask 25% of the possible goals and train on the remaining ones in order to investigate the generalisation capability of a trained agent. Figure 8 Showa that the agent is still able to traverse through these areas, it just never samples a goal there. <br />
[[File:fff8.png|600px|center]]<br />
<br />
A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.<br />
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]<br />
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]<br />
<br />
Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:<br />
<br />
a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.<br />
<br />
b) Binned representation. <br />
<br />
The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.<br />
<br />
[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]<br />
<br />
==Conclusion==<br />
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.<br />
<br />
==Critique==<br />
1. It is not clear how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modelling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.<br />
<br />
2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.<br />
<br />
3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.<br />
<br />
==Reference==<br />
[1] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, and Kavukcuoglu, Koray. Impala: Scalable distributed deep-rl with importance weighted actor-learner architec- tures. arXiv preprint arXiv:1802.01561, 2018.<br />
<br />
[2] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning, pp. 1928–1937, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:fff8.png&diff=40848File:fff8.png2018-11-22T07:47:19Z<p>X832zhan: </p>
<hr />
<div></div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map&diff=40847Learning to Navigate in Cities Without a Map2018-11-22T07:46:38Z<p>X832zhan: /* Results */</p>
<hr />
<div>Paper: <br />
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]<br />
A video of the paper is available here[https://sites.google.com/view/streetlearn].<br />
<br />
== Introduction ==<br />
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.<br />
<br />
1. Building an explicit map<br />
<br />
2. Planning and acting using that map. <br />
<br />
In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.<br />
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (neither map nor current position have not been used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]<br />
<br />
==Contribution==<br />
This paper has made the following contributions:<br />
<br />
1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.<br />
<br />
2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.<br />
<br />
3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.<br />
<br />
4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.<br />
<br />
==Related Work==<br />
<br />
1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment. <br />
<br />
2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.<br />
<br />
3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.<br />
<br />
==Environment==<br />
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes<br />
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to<br />
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.<br />
<br />
[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]<br />
<br />
==Agent Interface and the Courier Task==<br />
In an RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action). The most central edge is chosen if there are multiple edges in the agents viewing cone.<br />
<br />
There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp(-αd_{(t,i)}^g)}{∑_k exp(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).<br />
<br />
[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]<br />
<br />
This form of representation has several advantages: <br />
<br />
1. It could easily be extended to new environments.<br />
<br />
2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.<br />
<br />
3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.<br />
<br />
In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.<br />
<br />
==Methods==<br />
<br />
===Goal-dependent Actor-Critic Reinforcement Learning===<br />
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} &rarr; \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of<br />
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>: <br />
<br />
\begin{align}<br />
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)<br />
\end{align}<br />
<br />
Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:<br />
<br />
\begin{align}<br />
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]<br />
\end{align}<br />
<br />
Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.<br />
<br />
===Architectures===<br />
<br />
[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]<br />
<br />
The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.<br />
<br />
Three different architectures are described below.<br />
<br />
The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.<br />
<br />
The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information. <br />
<br />
The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.<br />
<br />
In this paper, the authors use IMPALA to train the agents because IMPALA can get similar performance to A3C. The authors use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50.<br />
<br />
===Curriculum Learning===<br />
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)<br />
<br />
==Results==<br />
In this section, the performance of the proposed architectures on the courier task is shown.<br />
<br />
[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way<br />
as in London) are comparable and the agent achieved mean goal reward 426.]]<br />
<br />
It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:<br />
<br />
1. Goal Navigation agent.<br />
<br />
2. City Navigation Agent.<br />
<br />
3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.<br />
<br />
Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.<br />
<br />
The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.<br />
<br />
[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]<br />
<br />
Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.<br />
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach<br />
a goal (Washington Square in New York or St Paul’s Cathedral in<br />
London) from 100 start locations vs. the straight-line distance to<br />
the goal in meters. One agent step corresponds to a forward movement<br />
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]<br />
<br />
The authors mask 25% of the possible goals and train on the remaining ones in order to investigate the generalisation capability of a trained agent. Figure 8 Showa that the agent is still able to traverse through these areas, it just never samples a goal there. <br />
[[File:fff8png|600px|center]]<br />
<br />
A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.<br />
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]<br />
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]<br />
<br />
Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:<br />
<br />
a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.<br />
<br />
b) Binned representation. <br />
<br />
The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.<br />
<br />
[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]<br />
<br />
==Conclusion==<br />
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.<br />
<br />
==Critique==<br />
1. It is not clear how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modelling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.<br />
<br />
2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.<br />
<br />
3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.<br />
<br />
==Reference==<br />
[1] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, and Kavukcuoglu, Koray. Impala: Scalable distributed deep-rl with importance weighted actor-learner architec- tures. arXiv preprint arXiv:1802.01561, 2018.<br />
<br />
[2] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning, pp. 1928–1937, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map&diff=40845Learning to Navigate in Cities Without a Map2018-11-22T07:35:55Z<p>X832zhan: /* Architectures */</p>
<hr />
<div>Paper: <br />
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]<br />
A video of the paper is available here[https://sites.google.com/view/streetlearn].<br />
<br />
== Introduction ==<br />
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.<br />
<br />
1. Building an explicit map<br />
<br />
2. Planning and acting using that map. <br />
<br />
In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.<br />
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (neither map nor current position have not been used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]<br />
<br />
==Contribution==<br />
This paper has made the following contributions:<br />
<br />
1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.<br />
<br />
2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.<br />
<br />
3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.<br />
<br />
4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.<br />
<br />
==Related Work==<br />
<br />
1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment. <br />
<br />
2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.<br />
<br />
3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.<br />
<br />
==Environment==<br />
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes<br />
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to<br />
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.<br />
<br />
[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]<br />
<br />
==Agent Interface and the Courier Task==<br />
In an RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action). The most central edge is chosen if there are multiple edges in the agents viewing cone.<br />
<br />
There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp(-αd_{(t,i)}^g)}{∑_k exp(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).<br />
<br />
[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]<br />
<br />
This form of representation has several advantages: <br />
<br />
1. It could easily be extended to new environments.<br />
<br />
2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.<br />
<br />
3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.<br />
<br />
In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.<br />
<br />
==Methods==<br />
<br />
===Goal-dependent Actor-Critic Reinforcement Learning===<br />
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} &rarr; \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of<br />
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>: <br />
<br />
\begin{align}<br />
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)<br />
\end{align}<br />
<br />
Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:<br />
<br />
\begin{align}<br />
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]<br />
\end{align}<br />
<br />
Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.<br />
<br />
===Architectures===<br />
<br />
[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]<br />
<br />
The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.<br />
<br />
Three different architectures are described below.<br />
<br />
The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.<br />
<br />
The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information. <br />
<br />
The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.<br />
<br />
In this paper, the authors use IMPALA to train the agents because IMPALA can get similar performance to A3C. The authors use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50.<br />
<br />
===Curriculum Learning===<br />
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)<br />
<br />
==Results==<br />
In this section, the performance of the proposed architectures on the courier task is shown.<br />
<br />
[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way<br />
as in London) are comparable and the agent achieved mean goal reward 426.]]<br />
<br />
It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:<br />
<br />
1. Goal Navigation agent.<br />
<br />
2. City Navigation Agent.<br />
<br />
3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.<br />
<br />
Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.<br />
<br />
The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.<br />
<br />
[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]<br />
<br />
Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.<br />
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach<br />
a goal (Washington Square in New York or St Paul’s Cathedral in<br />
London) from 100 start locations vs. the straight-line distance to<br />
the goal in meters. One agent step corresponds to a forward movement<br />
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]<br />
<br />
A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.<br />
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]<br />
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]<br />
<br />
Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:<br />
<br />
a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.<br />
<br />
b) Binned representation. <br />
<br />
The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.<br />
<br />
[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]<br />
<br />
==Conclusion==<br />
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.<br />
<br />
==Critique==<br />
1. It is not clear how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modelling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.<br />
<br />
2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.<br />
<br />
3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.<br />
<br />
==Reference==<br />
[1] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane, and Kavukcuoglu, Koray. Impala: Scalable distributed deep-rl with importance weighted actor-learner architec- tures. arXiv preprint arXiv:1802.01561, 2018.<br />
<br />
[2] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning, pp. 1928–1937, 2016.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=conditional_neural_process&diff=40836conditional neural process2018-11-22T07:03:02Z<p>X832zhan: /* Conditional Neural Process */</p>
<hr />
<div>== Introduction ==<br />
<br />
To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however. <br />
<br />
The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets. <br />
<br />
== Model ==<br />
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is to minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.<br />
<br />
<br />
Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} \subset X</math> of unlabelled points.<br />
<br />
P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>. <br />
<br />
A common assumption made on P is that all function evaluations of <math display="inline"> f </math> is Gaussian distributed. The random functions class is called Gaussian Processes (GPs). This framework of stochastic process allows a model to be data efficient, however, it's hard to get appropriate priors and stochastic processes are expensive in computation, scaling poorly with <math>n</math> and <math>m</math>. One of examples is GPs, which has running time <math>O(n+3)^3</math>.<br />
<br />
[[File:001.jpg|300px|center]]<br />
<br />
== Conditional Neural Process ==<br />
<br />
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.<br />
<br />
CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>, given a set of observations <math display="inline">O</math>. For stochastic processs, we assume <math display="inline">Q_{\theta}</math> is invariant to permutations, and <math display="inline">Q_\theta(f(T) | O, T)= Q_\theta(f(T') | O, T')=Q_\theta(f(T) | O', T) </math> when <math> O', T'</math> are permutations of <math display="inline">O</math> and <math display="inline">T </math>. In this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure, which is the easiest way to ensure a valid stochastic process. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>. Moreover, this framework can be extended to non-factored distributions.<br />
<br />
In detail, we use the following archiecture<br />
<br />
<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math><br />
<br />
<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math><br />
<br />
<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math><br />
<br />
Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.<br />
<br />
We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly<br />
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution<br />
P given a set of observations. The authors let <math display="inline"> f \sim P</math>, <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{n-1}</math>, and N ~ uniform[0, 1, ..... ,n-1]. Subset <math display="inline"> O = \{(x_i, y_i)\}_{i = 0} ^{N}</math> that is first N elements of <math display="inline">O</math> is regarded as condition. The negative conditional log probability is given by<br />
\[\mathcal{L}(\theta)=-\mathbb{E}_{f \sim p}[\mathbb{E}_{N}[\log Q_\theta(\{y_i\}_{i = 0} ^{n-1}|O_{N}, \{x_i\}_{i = 0} ^{n-1})]]\]<br />
Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed <br />
and unobserved values. In practice, we take Monte Carlo<br />
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>. <br />
<br />
This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately<br />
intended to summarize their empirical experience. Still, we emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.<br />
<br />
In summary,<br />
<br />
1. A CNP is a conditional distribution over functions<br />
trained to model the empirical conditional distributions<br />
of functions <math display="inline">f \sim P</math>.<br />
<br />
2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.<br />
<br />
3. A CNP is scalable, achieving a running time complexity<br />
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math><br />
observations.<br />
<br />
== Related Work ==<br />
<br />
===Gaussian Process Framework===<br />
<br />
Several papers attempt to address various issues with GPs. These include:<br />
* Using sparse GPs to aid in scaling (Snelson & Ghahramani, 2006)<br />
* Using Deep GPs to achieve more expressivity (Damianou & Lawrence,<br />
2013; Salimbeni & Deisenroth, 2017)<br />
* Using neural networks to learn more expressive kernels (Wilson et al., 2016)<br />
<br />
===Meta Learning===<br />
<br />
Meta-Learning attempts to allow neural networks to learn more generalizable functions, as opposed to only approximating one function. This can be done by learning deep generative models which can do few-shot estimations of data. This can be implemented with attention mechanisms or additional memory.<br />
<br />
Classification is another common task in meta learning, few-shot classification algorithms usually rely on some distance metric in feature space to compare target images and the observations. Matching networks(Vinyals et al., 2016; Bartunov & Vetrov, 2016) are closely related to CNPs.<br />
<br />
Finally, the latent variant of Conditional Neural Process can also be seen as an approximated amortized version of Bayesian DL(Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017).<br />
<br />
== Experimental Result I: Function Regression ==<br />
<br />
Classical 1D regression task that used as a common baseline for GP is our first example. <br />
They generated two different datasets that consisted of functions<br />
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with<br />
different kernel parameters. At every training step they sampled a curve from the GP, select<br />
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation<br />
<math display="inline">r = \frac{1}{n} \sum r_i</math><br />
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer<br />
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer. <br />
<br />
Two examples of the regression results obtained for each<br />
of the datasets are shown in the following figure.<br />
<br />
[[File:007.jpg|300px|center]]<br />
<br />
They compared the model to the predictions generated by a GP with the correct<br />
hyperparameters, which constitutes an upper bound on our<br />
performance. Although the prediction generated by the GP<br />
is smoother than the CNP's prediction both for the mean<br />
and variance, the model is able to learn to regress from a few<br />
context points for both the fixed kernels and switching kernels.<br />
As the number of context points grows, the accuracy<br />
of the model improves and the approximated uncertainty<br />
of the model decreases. Crucially, we see the model learns<br />
to estimate its own uncertainty given the observations very<br />
accurately. Nonetheless it provides a good approximation<br />
that increases in accuracy as the number of context points<br />
increases.<br />
Furthermore the model achieves similarly good performance<br />
on the switching kernel task. This type of regression task<br />
is not trivial for GPs whereas in our case we only have to<br />
change the dataset used for training<br />
<br />
== Experimental Result II: Image Completion for Digits ==<br />
<br />
[[File:002.jpg|600px|center]]<br />
<br />
They also tested CNP on the MNIST dataset and use the test<br />
set to evaluate its performance. As shown in the above figure the<br />
model learns to make good predictions of the underlying<br />
digit even for a small number of context points. Crucially,<br />
when conditioned only on one non-informative context point the model’s prediction corresponds<br />
to the average over all MNIST digits. As the number<br />
of context points increases the predictions become more<br />
similar to the underlying ground truth. This demonstrates<br />
the model’s capacity to extract dataset specific prior knowledge.<br />
It is worth mentioning that even with a complete set<br />
of observations the model does not achieve pixel-perfect<br />
reconstruction, as we have a bottleneck at the representation<br />
level.<br />
Since this implementation of CNP returns factored outputs,<br />
the best prediction it can produce given limited context<br />
information is to average over all possible predictions that<br />
agree with the context. An alternative to this is to add<br />
latent variables in the model such that they can be sampled<br />
conditioned on the context to produce predictions with high<br />
probability in the data distribution. <br />
<br />
<br />
An important aspect of the model is its ability to estimate<br />
the uncertainty of the prediction. As shown in the bottom<br />
row of the above figure, as they added more observations, the variance<br />
shifts from being almost uniformly spread over the digit<br />
positions to being localized around areas that are specific<br />
to the underlying digit, specifically its edges. Being able to<br />
model the uncertainty given some context can be helpful for<br />
many tasks. One example is active exploration, where the<br />
model has a choice over where to observe.<br />
They tested this by<br />
comparing the predictions of CNP when the observations<br />
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active<br />
exploration, but it already produces better prediction results<br />
than selecting the conditioning points at random.<br />
<br />
== Experimental Result III: Image Completion for Faces ==<br />
<br />
<br />
[[File:003.jpg|400px|center]]<br />
<br />
<br />
They also applied CNP to CelebA, a dataset of images of<br />
celebrity faces, and reported performance obtained on the<br />
test set.<br />
<br />
As shown in the above figure our model is able to capture<br />
the complex shapes and colours of this dataset with predictions<br />
conditioned on less than 10% of the pixels being<br />
already close to ground truth. As before, given few context<br />
points the model averages over all possible faces, but as<br />
the number of context pairs increases the predictions capture<br />
image-specific details like face orientation and facial<br />
expression. Furthermore, as the number of context points<br />
increases the variance is shifted towards the edges in the<br />
image.<br />
<br />
[[File:004.jpg|400px|center]]<br />
<br />
An important aspect of CNPs demonstrated in the above figure is<br />
its flexibility not only in the number of observations and<br />
targets it receives but also with regards to their input values.<br />
It is interesting to compare this property to GPs on one hand,<br />
and to trained generative models (van den Oord et al., 2016;<br />
Gregor et al., 2015) on the other hand.<br />
The first type of flexibility can be seen when conditioning on<br />
subsets that the model has not encountered during training.<br />
Consider conditioning the model on one half of the image,<br />
fox example. This forces the model to not only predict pixel<br />
values according to some stationary smoothness property of<br />
the images, but also according to global spatial properties,<br />
e.g. symmetry and the relative location of different parts of<br />
faces. As seen in the first row of the figure, CNPs are able to<br />
capture those properties. A GP with a stationary kernel cannot<br />
capture this, and in the absence of observations would<br />
revert to its mean (the mean itself can be non-stationary but<br />
usually this would not be enough to capture the interesting<br />
properties).<br />
<br />
In addition, the model is flexible with regards to the target<br />
input values. This means, e.g., we can query the model<br />
at resolutions it has not seen during training. We take a<br />
model that has only been trained using pixel coordinates of<br />
a specific resolution, and predict at test time subpixel values<br />
for targets between the original coordinates. As shown in<br />
Figure 5, with one forward pass we can query the model at<br />
different resolutions. While GPs also exhibit this type of<br />
flexibility, it is not the case for trained generative models,<br />
which can only predict values for the pixel coordinates on<br />
which they were trained. In this sense, CNPs capture the best<br />
of both worlds – it is flexible in regards to the conditioning<br />
and prediction task, and has the capacity to extract domain<br />
knowledge from a training set.<br />
<br />
[[File:010.jpg|400px|center]]<br />
<br />
<br />
They compared CNPs quantitatively to two related models:<br />
kNNs and GPs. As shown in the above table CNPs outperform<br />
the latter when number of context points is small (empirically<br />
when half of the image or less is provided as context).<br />
When the majority of the image is given as context exact<br />
methods like GPs and kNN will perform better. From the table<br />
we can also see that the order in which the context points<br />
are provided is less important for CNPs, since providing the<br />
context points in order from top to bottom still results in<br />
good performance. Both insights point to the fact that CNPs<br />
learn a data-specific ‘prior’ that will generate good samples<br />
even when the number of context points is very small.<br />
<br />
== Experimental Result IV: Classification ==<br />
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes<br />
of characters from 50 different alphabets. Each class has<br />
only 20 examples and as such this dataset is particularly<br />
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as<br />
their training set and the remainder as our testing data set.<br />
This includes cropping<br />
the image from 32 × 32 to 28 × 28, applying small random<br />
translations and rotations to the inputs, and also increasing<br />
the number of classes by rotating every character by 90<br />
degrees and defining that to be a new class. They generated<br />
the labels for an N-way classification task by choosing N<br />
random classes at each training step and arbitrarily assigning<br />
the labels 0, ..., N − 1 to each.<br />
<br />
<br />
[[File:008.jpg|400px|center]]<br />
<br />
Given that the input points are images, they modified the architecture<br />
of the encoder h to include convolution layers as<br />
mentioned in section 2. In addition they only aggregated over<br />
inputs of the same class by using the information provided<br />
by the input label. The aggregated class-specific representations<br />
are then concatenated to form the final representation.<br />
Given that both the size of the class-specific representations<br />
and the number of classes are constant, the size of the final<br />
representation is still constant and thus the O(n + m)<br />
runtime still holds.<br />
The results of the classification are summarized in the following table<br />
CNPs achieve higher accuracy than models that are significantly<br />
more complex (like MANN). While CNPs do not<br />
beat state of the art for one-shot classification our accuracy<br />
values are comparable. Crucially, they reached those values<br />
using a significantly simpler architecture (three convolutional<br />
layers for the encoder and a three-layer MLP for the<br />
decoder) and with a lower runtime of O(n + m) at test time<br />
as opposed to O(nm)<br />
<br />
== Conclusion ==<br />
<br />
In this paper they had introduced Conditional Neural Processes,<br />
a model that is both flexible at test time and has the<br />
capacity to extract prior knowledge from training data.<br />
<br />
We had demonstrated its ability to perform a variety of tasks<br />
including regression, classification and image completion.<br />
We compared CNPs to Gaussian Processes on one hand, and<br />
deep learning methods on the other, and also discussed the<br />
relation to meta-learning and few-shot learning.<br />
It is important to note that the specific CNP implementations<br />
described here are just simple proofs-of-concept and can<br />
be substantially extended, e.g. by including more elaborate<br />
architectures in line with modern deep learning advances.<br />
To summarize, this work can be seen as a step towards learning<br />
high-level abstractions, one of the grand challenges of<br />
contemporary machine learning. Functions learned by most<br />
Conditional Neural Processes<br />
conventional deep learning models are tied to a specific, constrained<br />
statistical context at any stage of training. A trained<br />
CNP is more general, in that it encapsulates the high-level<br />
statistics of a family of functions. As such it constitutes a<br />
high-level abstraction that can be reused for multiple tasks.<br />
In future work they are going to explore how far these models can<br />
help in tackling the many key machine learning problems<br />
that seem to hinge on abstraction, such as transfer learning,<br />
meta-learning, and data efficiency.<br />
<br />
== Other Sources ==<br />
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]<br />
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]<br />
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]<br />
<br />
== Reference ==<br />
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative<br />
models with generative matching networks. arXiv<br />
preprint arXiv:1612.02192, 2016.<br />
<br />
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,<br />
D. Weight uncertainty in neural networks. arXiv preprint<br />
arXiv:1505.05424, 2015.<br />
<br />
Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.<br />
Variational memory addressing in generative models. In<br />
Advances in Neural Information Processing Systems, pp.<br />
3923–3932, 2017.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.<br />
<br />
Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and<br />
Kohli, P. Neural program meta-induction. In Advances in<br />
Neural Information Processing Systems, pp. 2077–2085,<br />
2017.<br />
<br />
Edwards, H. and Storkey, A. Towards a neural statistician.<br />
2016.<br />
<br />
Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning<br />
for fast adaptation of deep networks. arXiv<br />
preprint arXiv:1703.03400, 2017.<br />
<br />
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:<br />
Representing model uncertainty in deep learning.<br />
In international conference on machine learning, pp.<br />
1050–1059, 2016.<br />
<br />
Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards<br />
deep symbolic reinforcement learning. arXiv preprint<br />
arXiv:1609.05518, 2016.<br />
<br />
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and<br />
Wierstra, D. Draw: A recurrent neural network for image<br />
generation. arXiv preprint arXiv:1502.04623, 2015.<br />
<br />
Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The<br />
variational homoencoder: Learning to infer high-capacity<br />
generative models from few examples. 2018.<br />
<br />
J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,<br />
et al. One-shot generalization in deep generative models.<br />
In International Conference on Machine Learning, pp.<br />
1521–1529, 2016.<br />
<br />
Kingma, D. P. and Ba, J. Adam: A method for stochastic<br />
optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
Kingma, D. P. and Welling, M. Auto-encoding variational<br />
bayes. arXiv preprint arXiv:1312.6114, 2013.<br />
<br />
Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural<br />
networks for one-shot image recognition. In ICML Deep<br />
Learning Workshop, volume 2, 2015.<br />
<br />
Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.<br />
Human-level concept learning through probabilistic program<br />
induction. Science, 350(6266):1332–1338, 2015.<br />
<br />
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,<br />
S. J. Building machines that learn and think like<br />
people. Behavioral and Brain Sciences, 40, 2017.<br />
<br />
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased<br />
learning applied to document recognition. Proceedings<br />
of the IEEE, 86(11):2278–2324, 1998.<br />
<br />
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face<br />
attributes in the wild. In Proceedings of International<br />
Conference on Computer Vision (ICCV), December 2015.<br />
<br />
Louizos, C. and Welling, M. Multiplicative normalizing<br />
flows for variational bayesian neural networks. arXiv<br />
preprint arXiv:1703.01961, 2017.<br />
<br />
Louizos, C., Ullrich, K., and Welling, M. Bayesian compression<br />
for deep learning. In Advances in Neural Information<br />
Processing Systems, pp. 3290–3300, 2017.<br />
<br />
Rasmussen, C. E. and Williams, C. K. Gaussian processes<br />
in machine learning. In Advanced lectures on machine<br />
learning, pp. 63–71. Springer, 2004.<br />
<br />
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,<br />
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot<br />
autoregressive density estimation: Towards learning to<br />
learn distributions. 2017.<br />
<br />
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic<br />
backpropagation and approximate inference in deep generative<br />
models. arXiv preprint arXiv:1401.4082, 2014.<br />
<br />
Salimbeni, H. and Deisenroth, M. Doubly stochastic variational<br />
inference for deep gaussian processes. In Advances<br />
in Neural Information Processing Systems, pp.<br />
4591–4602, 2017.<br />
<br />
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and<br />
Lillicrap, T. One-shot learning with memory-augmented<br />
neural networks. arXiv preprint arXiv:1605.06065, 2016.<br />
<br />
Snell, J., Swersky, K., and Zemel, R. Prototypical networks<br />
for few-shot learning. In Advances in Neural Information<br />
Processing Systems, pp. 4080–4090, 2017.<br />
<br />
Snelson, E. and Ghahramani, Z. Sparse gaussian processes<br />
using pseudo-inputs. In Advances in neural information<br />
processing systems, pp. 1257–1264, 2006.<br />
<br />
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,<br />
O., Graves, A., et al. Conditional image generation with<br />
pixelcnn decoders. In Advances in Neural Information<br />
Processing Systems, pp. 4790–4798, 2016.<br />
<br />
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.<br />
Matching networks for one shot learning. In Advances in<br />
Neural Information Processing Systems, pp. 3630–3638,<br />
2016.<br />
<br />
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,<br />
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and<br />
Botvinick, M. Learning to reinforcement learn. arXiv<br />
preprint arXiv:1611.05763, 2016.<br />
<br />
Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.<br />
Deep kernel learning. In Artificial Intelligence and Statistics,<br />
pp. 370–378, 2016.<br />
<br />
Damianou, A. and Lawrence, N. Deep gaussian processes.<br />
In Artificial Intelligence and Statistics, pp. 207–215,<br />
2013.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mapping_Images_to_Scene_Graphs_with_Permutation-Invariant_Structured_Prediction&diff=40830Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction2018-11-22T06:34:03Z<p>X832zhan: /* Proof Sketch for Theorem 1 */</p>
<hr />
<div>The paper ''Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction'' was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as of May 2018) obtained from [https://arxiv.org/pdf/1802.05451v3.pdf arXiv] <br />
<br />
(*) Equal contribution<br />
<br />
=Motivation=<br />
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance and proves state of the art performance on models that follow this principle.<br />
<br />
The primary contributions that this paper makes include:<br />
# Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures<br />
# Empirically proving the benefit of graph-permutation invariance<br />
# Developing a state-of-the-art model for scene graph predictions over a large set of complex visual scenes<br />
<br />
=Introduction=<br />
In order to make a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A '''scene graph''' is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes, relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.<br />
<br />
[[File:scene_graph_example.png|600px|center]]<br />
<br />
Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.<br />
<br />
In structured prediction models, a score function <math>s(x, y)</math> is defined to evaluate the compatibility between label <math>y</math> and input <math>x</math>. For instance, when interpreting the scene of an image, <math>x</math> refers to the image itself, and <math>y</math> refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label <math>y^*</math> such that <math>s(x,y)</math> is maximized, <math> y^*=argmax_y s(x,y)</math>. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function. <br />
<br />
The paper presents an alternative approach, for which input <math>x</math> is mapped to structured output <math>y</math> using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.<br />
<br />
The model is evaluated by firstly demonstrating the importance of permutation invariance on a synthetic data set. The approach laid out by the authors is then shown to respect permutation invariance, and results are compared to a competitive benchmark. This method achieves state-of-the-art results.<br />
<br />
=Structured prediction=<br />
This paper further considers structured predictions using score-based methods. For structured predictions that follow a score-based approach, a score function <math>s(x, y)</math> is used to measure how compatible label <math>y</math> is for input <math>x</math> and is also used to infer a label by maximizing <math>s(x, y)</math>. To optimize the score function, previous works have decomposed <math>s(x,y) = \sum_i f_i(x,y)</math> in order to facilitate efficient optimization which is done by optimizing the local score function, <math>\max_y f_i(x,y)</math>, with a small subset of the <math>y</math> variables.<br />
<br />
Recently, modeling the <math>f_i </math> functions as deep networks is a new interest. In such area of structured predictions, the most commonly-used score functions include the singleton score function <math>f_i(y_i, x)</math> and pairwise score function <math>f_{ij} (y_i, y_j, x)</math>. Previous works explored a two-stage architectures (learn local scores independently of the structured prediction goal), end-to-end architectures (to include the inference algorithm within the computation graph), and modelling global factors. <br />
<br />
==Advantages of using score-based methods==<br />
# Allow for intuitive specification of local dependencies between labels, and how they map to global dependencies<br />
# Linear score functions offer natural convex surrogates<br />
# Inference in large label space is sometimes possible via exact algorithms or empirically accurate approximations<br />
<br />
The concern for modelling score functions using deep networks is that learning may no longer be convex. Hence, the paper presents properties for how deep networks can be used for structured predictions by considering architectures that do not require explicit maximization of a score function.<br />
<br />
=Background, Notations, and Definitions=<br />
We denote <math>y</math> as a structured label where <math>y = [y_1, \dots, y_n]</math><br />
<br />
'''Score functions:''' for score-based methods, the score is defined as either the sum of a set of singleton scores <math>f_i = f_i(y_i, x)</math> or the sum of pairwise scores <math>f_{ij} = f_{ij}(y_i, y_j, x)</math>.<br />
<br />
Let <math>s(x,y)</math> be the score of a score-based method. Then:<br />
<br />
<div align="center"><br />
<math>s(x,y) = \begin{cases}<br />
\sum_i f_i ~ \text{if we have a set of singleton scores}\\<br />
\sum_{ij} f_{ij} ~ \text{if we have a set of pairwise scores } \\<br />
\end{cases}</math><br />
</div><br />
<br />
'''Inference algorithm:''' an inference algorithm takes input set of local scores (either <math>f_i</math> or <math>f_{ij}</math>) and outputs an assignment of labels <math>y_1, \dots, y_n</math> that maximizes score function <math>s(x,y)</math><br />
<br />
'''Graph labeling function:''' a graph labeling function <math>\mathcal{F} : (V,E) \rightarrow Y</math> is a function that takes input of: an ordered set of node features <math>V = [z_1, \dots, z_n]</math> and an ordered set of edge features <math>E = [z_{1,2},\dots,z_{i,j},\dots,z_{n,n-1}]</math> to output set of node labels <math>\mathbf{y} = [y_1, \dots, y_n]</math>. For instance, <math>z_i</math> can be set equal to <math>f_i</math> and <math>z_{ij}</math> can be set equal to <math>f_{ij}</math>.<br />
<br />
For convenience, the joint set of nodes and edges will be denoted as <math>\mathbf{z}</math> to be a size <math>n^2</math> vector (<math>n</math> nodes and <math>n(n-1)</math> edges).<br />
<br />
'''Permutation:''' Let <math>z</math> be a set of node and edge features. Given a permutation <math>\sigma</math> of <math>\{1,\dots,n\}</math>, let <math>\sigma(z)</math> be a new set of node and edge features given by [<math>\sigma(z)]_i = z_{\sigma(i)}</math> and <math>[\sigma(z)]_{i,j} = z_{\sigma(i), \sigma(j)}</math><br />
<br />
'''One-hot representation:''' <math>\mathbf{1}[j]</math> be a one-hot vector with 1 in the <math>j^{th}</math> coordinate<br />
<br />
=Permutation-Invariant Structured prediction=<br />
<br />
With permutation-invariant structured prediction, we would expect the algorithm to produce the same result given the same score function. For instance, consider the case where we have label space for 3 variables <math>y_1, y_2, y_3</math> with input <math>\mathbf{z} = (f_1, f_2, f_3, f_{12}, f_{13}, f_{23})</math> that outputs label <math>\mathbf{y} = (y_1^*, y_2^*, y_3^*)</math>. Then if the algorithm is run on a permuted version input <math>z' = (f_2, f_1, f_3, f_{21}, f_{23}, f_{13})</math>, we would expect <math>\mathbf{y} = (y_2^*, y_1^*, y_3^*)</math> given the same score function.<br />
<br />
'''Graph permutation invariance (GPI):''' a graph labeling function <math>\mathcal{F}</math> is graph-permutation invariant, if for all permutations <math>\sigma</math> of <math>\{1, \dots, n\}</math> and for all nodes <math>z</math>, <math>\mathcal{F}(\sigma(\mathbf{z})) = \sigma(\mathcal{F}(\mathbf{z}))</math><br />
<br />
The paper presents a theorem on the necessary and sufficient conditions for a function <math>\mathcal{F}</math> to be graph permutation invariant. Intuitively, because <math>\mathcal{F}</math> is a function that takes an ordered set <math>z</math> as input, the output on <math>\mathbf{z}</math> could very well be different from <math>\sigma(\mathbf{z})</math>, which means <math>\mathcal{F}</math> needs to have some sort of symmetry in order to sustain <math>[\mathcal{F}(\sigma(\mathbf{z}))]]_k = [\mathcal{F}(\mathbf{z})]_{\sigma(k)}</math>.<br />
<br />
[[File:graph_permutation_invariance.jpg|400px|center]]<br />
<br />
==Theorem 1==<br />
Let <math>\mathcal{F}</math> be a graph labeling function. Then <math>\mathcal{F}</math> is graph-permutation invariant if and only if there exist functions <math>\alpha, \rho, \phi</math> such that for all <math>k=1, .., n</math>:<br />
\begin{align}<br />
[\mathcal{F}(\mathbf{z})]_k = \rho(\mathbf{z}_k, \sum_{i=1}^n \alpha(\mathbf{z}_i, \sum_{i\neq j} \phi(\mathbf{z}_i, \mathbf{z}_{i,j}, \mathbf{z}_j)))<br />
\end{align}<br />
where <math>\phi: \mathbb{R}^{2d+e} \rightarrow \mathbb{R}^L, \alpha: \mathbb{R}^{d + L} \rightarrow \mathbb{R}^{W}, p: \mathbb{R}^{W+d} \rightarrow \mathbb{R}</math>.<br />
<br />
Notice that for the dimensions of inputs and outputs, <math>d</math> refers to the number of singleton features in <math>z</math> and <math>e</math> refers to the number of edges. <br />
<br />
[[File:GPI_architecture.jpg|thumb|A schematic representation of the GPI architecture. Singleton features <math>z_i</math> are omitted for simplicity. First, the features <math>z_{i,j}</math> are processed element-wise by <math>\phi</math>. Next, they are summed to create a vector <math>s_i</math>, which is concatenated with <math>z_i</math>. Third, a representation of the entire graph is created by applying <math>\alpha\ n</math> times and summing the created vector. The graph representation is then finally processed by <math>\rho</math> together with <math>z_k</math>.|600px|center]]<br />
<br />
==Proof Sketch for Theorem 1==<br />
The proof of this theorem can be found in the paper. A proof sketch is provided below:<br />
<br />
'''For the forward direction''' (function that follows the form set out in equation (1) is GPI):<br />
# Using definition of permutation <math>\sigma</math>, and rewriting <math>[F(z)]_{\sigma(k)}</math> in the form from equation (1)<br />
# Second argument of <math>\rho</math> is invariant under <math>\sigma</math>, since it takes the sum of all indices <math>i</math> and all other indices <math>j \neq i </math>.<br />
<br />
'''For the backward direction''' (any black-box GPI function can be expressed in the form of equation 1):<br />
# Construct <math>\phi, \alpha</math> such that second argument of <math>\rho</math> contains all information about graph features of <math>z</math>, including edges that the features originate from<br />
# Assume each <math>z_k</math> uniquely identifies the node and <math>\mathcal{F}</math> is a function only of pairwise features <math>z_{i,j}</math><br />
# Construct <math>H</math> be a perfect hash function with <math>L</math> buckets, and <math>\phi</math> which maps '''pairwise features''' to a vector of size <math>L</math><br />
# <math>*</math>Construct <math>\phi(z_i, z_{i,j}, z_j) = \mathbf{1}[H(z_j)] z_{i,j}</math>, which intuitively means that <math>\phi</math> stores <math>z_{i,j}</math> in the unique bucket for node <math>j</math><br />
# Construct function <math>\alpha</math> to output a matrix <math>\mathbb{R}^{L \times L}</math> that maps each pairwise feature into unique positions (<math>\alpha(z_i, s_i) = \mathbf{1}[H(z_i)]s_i^T</math>)<br />
# Construct matrix <math>M = \sum_i \alpha(z_i,s_i)</math> by discarding rows/columns in <math>M</math> that do not correspond to original nodes (which reduces dimension to <math>n\times n</math>; set <math>\rho</math> to have same outcome as <math>\mathcal{F}</math>, and set the output of <math>\mathcal{F}</math> on <math>M</math> to be the labels <math>\mathbf{y} = y_1, \dots, y_n</math><br />
<br />
<math>*</math>The paper presents the proof for the edge features <math>z_{ij}</math> being scalar (<math>e = 1</math>) for simplicity, which can be extended easily to vectors with additional indexing.<br />
<br />
Although the results discussed previously apply to complete graphs (edges apply to all feature pairs), it can be easily extended to incomplete graphs. For incomplete graphs, the input to F only contains the features corresponding to valid edges of the graph. The authors are only interested in invariances that preserve the graph structure. Thus, in place of permutation-invariance, it is now an automorphism-invariance.<br />
<br />
==Implications and Applications of Theorem 1==<br />
===Key Implications of Theorem 1===<br />
# Architecture "collects" information from the different edges of the graph, and does so in an invariant fashion using <math>\alpha</math> and <math>\phi</math><br />
# Architecture is parallelizable, since all <math>\phi</math> functions can be applied simultaneously<br />
<br />
===Some applications of Theorem 1===<br />
# '''Attention:''' the concept of attention can be implemented in the GPI characterization, with slight alterations to the functions <math>\alpha</math> and <math>\phi</math>. In attention each node aggregates features of neighbours through a function of neighbour's relevance. Which means the lable of an entity could depend strongly on its close entity. The complete details can be found in the supplementary materials of the paper.<br />
<br />
# '''RNN:''' recurrent architectures can maintain GPI property, since all GPI function <math>\mathcal{F}</math> are closed under composition. The output of one step after running <math>\mathcal{F}</math> will act as input for the next step, but maintain the GPI property throughout.<br />
<br />
=Related Work=<br />
# '''Architectural invariance:''' suggested recently in a 2017 paper called Deep Sets by Zaheer et al., which considers the case of invariance that is more restrictive.<br />
# '''Deep structured prediction:''' previous work applied deep learning to structured prediction, for instance, semantic segmentation. Some algorithms include message passing algorithms, gradient descent for maximizing score functions, greedy decoding (inference of labels based on time of previous labels). Apart from those algorithms, deep learning has been applied to other graph-based problems such as the Travelling Salesman Problem (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017). However, none of the previous work specifically address the notion of invariance in the general architecture, but rather focus on message passing architectures that can be generalized by this paper.<br />
# '''Scene graph prediction:''' scene graph extraction allows for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017). Some other works in this area include object detection, action recognition, and even detection of human-object interactions (Liao et al., 2016; Plummer et al., 2017). Additional work has been done with the use of message passing algorithms (Xu et al., 2017), word embeddings (Lu et al., 2016), and end-to-end prediction directly from pixels (Newell & Deng, 2017). A notable mention is NeuralMotif (Zellers et al., 2017), which the authors describe as the current state-of-the-art model for scene graph predictions on Visual Genome dataset.<br />
# '''Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks:''' similar ideas were applied, where Permutation Invariant CNN, are used to restore sharp and noise-free images from bursts of photographs affected by hand tremor and noise. This presented good quality images with lots of details for challenging datasets.<br />
<br />
=Experimental Results=<br />
==Synthetic Graph Labeling==<br />
The authors created a synthetic problem to study GPI. This involved using an input graph <math>G = (V,E)</math> where each node <math>i</math> belongs to the set <math>\Gamma(i) \in \{1, \dots, K\}</math> where <math>K</math> is the number of samples. The task is to compute for each node, the number of neighbours that belong to the same set (i.e. finding the label of the node <math>i</math> if <math>y_i = \sum_{j \in N(i)} \mathbf{1}[\Gamma(i) = \Gamma(j)]</math>) . Then, random graphs (each with 10 nodes) were generated by sampling edges, and the set <math>\Gamma(i) \in \{1, \dots, K\}</math>for each node independently and uniformly.<br />
The node features of the graph <math>z_i \in \{0,1\}^K</math> are one-hot vectors of <math>\Gamma(i)</math>, and each pairwise edge feature <math>z_{ij} \in \{0, 1\}</math> denote whether the edge <math>ij</math> is in the edge set <math>E</math>. <br />
3 architectures were studied in this paper:<br />
# '''GPI-architecture for graph prediction''' (without attention and RNN)<br />
# '''LSTM''': replacing <math>\sum \phi(\cdot)</math> and <math>\sum \alpha(\cdot)</math> in the form of Theorem 1 using two LSTMs with state size 200, reading their input in random order<br />
# '''Fully connected feed-forward network''': with 2 hidden layers, each layer containing 1,000 nodes; the input is a concatenation of all nodes and pairwise features, and the output is all node predictions<br />
<br />
The results show that the GPI architecture requires far fewer samples to converge to the correct solution.<br />
[[File:GPI_synthetic_example.jpg|450px|center]]<br />
<br />
==Scene-Graph Classification==<br />
Applying the concept of GPI to Scene-Graph Prediction (SGP) is the main task of this paper. The input to this problem is an image, along with a set of annotated bounding boxes for the entities in the image. The goal is to correctly label each entity within the bounding boxes and the relationship between every pair of entities, resulting in a coherent scene graph.<br />
<br />
The authors describe two different types of variables to predict. The first type is entity variables <math>[y_1, \dots, y_n]</math> for all bounding boxes, where each <math>y_i</math> can take one of L values and refers to objects such as "dog" or "man". The second type is relation variables <math>[y_{n+1}, \cdots, y_{n^2}]</math>, where each <math>y_i</math> represents the relation (e.g. "on", "below") between a pair of bounding boxes (entities).<br />
<br />
The scene graph and contain two types of edges:<br />
# '''Entity-entity edge''': connecting two entities <math>y_i</math> and <math>y_j</math> for <math>1 \leq i \neq j \leq n</math><br />
# '''Entity-relation edges''': connecting every relation variable <math>y_k</math> for <math>k > n</math> to two entities<br />
<br />
The feature set <math>\mathbf{z}</math> is based on the baseline model from Zellers et al. (2017). For entity variables <math>y_i</math>, the vector <math>\mathbf{z}_i \in \mathbb{R}^L</math> models the probability of the entity appearing in <math>y_i</math>. <math>\mathbf{z}_i</math> is augmented by the coordinates of the bounding box. Similarly for relation variables <math>y_j</math>, the vector <math>\mathbf{z}_j \in \mathbb{R}^R</math>, models the probability of the relations between the two entities in <math>j</math>. For entity-entity pairwise features <math>\mathbf{z}_{i,j}</math>, there is a similar representation of the probabilities for the pair. The SGP outputs probability distributions over all entities and relations, which will then be used as input recurrently to maintain GPI. Finally, word embeddings are used and concatenated for the most probable entity-relation labels.<br />
<br />
'''Components of the GPI architecture''' (ent for entity, rel for relation)<br />
# <math>\phi_{ent}</math>: network that integrates two entity variables <math>y_i</math> and <math>y_j</math>, with input <math>z_i, z_j, z_{i,j}</math> and output vector of <math>\mathbb{R}^{n_1}</math> <br />
# <math>\alpha_{ent}</math>: network with inputs from <math>\phi_{ent}</math> for all neighbours of an entity, and uses attention mechanism to output vector <math>\mathbb{R}^{n_2}</math> <br />
# <math>\rho_{ent}</math>: network with inputs from the various <math>\mathbb{R}^{n_2}</math> vectors, and outputs <math>L</math> logits to predict entity value<br />
# <math>\rho_{rel}</math>: network with inputs <math>\alpha_{ent}</math> of two entities and <math>z_{i,j}</math>, and output into <math>R</math> logits<br />
<br />
==Set-up and Results==<br />
'''Dataset''': based on Visual Genome (VG) by (Krishna et al., 2017), which contains a total of 108,077 images annotated with bounding boxes, entities, and relations. An average of 12 entities and 7 relations exist per image. For a fair comparison with previous works, data from (Xu et al., 2017) for train and test splits were used. The authors used the same 150 entities and 50 relations as in (Xu et al., 2017; Newell & Deng, 2017; Zellers et al., 2017). Hyperparameters were tuned using a 70K/5K/32K split for training, validation, and testing respectively.<br />
<br />
'''Training''': all networks were trained using the Adam optimizer, with a batch size of 20. The loss function was the sum of cross-entropy losses over all of entities and relations. Penalties for misclassified entities were 4 times stronger than that of relations. Penalties for misclassified negative relations were 10 times weaker than that of positive relations.<br />
<br />
'''Evaluation''': there are three major tasks when inferring from the scene graph. The authors focus on the following:<br />
# '''SGCIs''': given ground-truth entity bounding boxes, predict all entity and relations categories<br />
# '''PredCIs''': given annotated bounding boxes with entity labels, predict all relations<br />
<br />
The evaluation metric Recall@K (shortened to R@K) is drawn from (Lu et al., 2016). This metric is the fraction of correct ground-truth triplets that appear within the <math>K</math> most confident triplets predicted by the model. Graph-constrained protocol requires the top-<math>K</math> triplets to assign one consistent class per entity and relation. The unconstrained protocol does not enforce such constraint.<br />
<br />
'''Models and baselines''': The authors compared variants of the GPI approach against four baselines, state-of-the-art models on completing scene graph sub-tasks. To maintain consistency, all models used the same training/testing data split, in addition to the preprocessing as per (Xu et al., 2017).<br />
<br />
'''Baselines from existing state-of-the-art models'''<br />
# (Lu et al., 2016): use of word embeddings to fine-tune the likelihood of predicted relations<br />
# (Xu et al., 2017): message passing algorithm between entities and relations to iteratively improve feature map for prediction<br />
# (Newell & Deng, 2017): Pixel2Graph, uses associative embeddings to produce a full graph from image<br />
# (Zellers et al., 2017): NeuralMotif method, encodes global context to capture higher-order motif in scene graphs; Baseline outputs entities and relations distributions without using global context<br />
<br />
'''GPI models'''<br />
# '''GPI with no attention mechanism''': simply following Theorem 1's functional form, with summation over features<br />
# '''GPI NeighborAttention''': same GPI model, but considers attention over neighbours features<br />
# '''GPI Linguistic''': similar to NeighborAttention model, but concatenates word embedding vectors<br />
<br />
'''Key Results''': The GPI Linguistic approach outperforms all baseline for SGCIs, and has similar performance to the state of the art NeuralMotifs method. The authors argue that PredCI is an easier task with less structure, yielding high performance for the existing state of the art models.<br />
<br />
[[File:GPI_table_results.png|700px|center]]<br />
<br />
=Conclusion=<br />
<br />
A deep learning approach was presented in this paper to structured prediction, which constrains the architecture to be invariant to structurally identical inputs. This approach relies on pairwise features which are capable of describing inter-label correlations and inherits the intuitive aspect of score-based approaches. The output produced is invariant to equivalent representation of the pairwise terms. <br />
<br />
As future work, the axiomatic approach can be extended; for example in image labeling, geometric variances such as shifts or rotations may be desired (or in other cases invariance to feature permutations may be desired). Additionally, exploring algorithms that discover symmetries for deep structured prediction when invariant structure is unknown and should be discovered from data is also an interesting extension of this work.<br />
<br />
=Critique=<br />
The paper's contribution comes from the novelty of the permutation invariance as a design guideline for structured prediction. Although not explicitly considered in many of the previous works, the idea of invariance in architecture has already been considered in Deep Sets by (Zaheer et al., 2017). This paper characterizes relaxes the condition on the invariance as compared to that of previous works. In the evaluation of the benefit of GPI models, the paper used a synthetic problem to illustrate the fact that far fewer samples are required for the GPI model to converge to 100% accuracy. However, when comparing the true task of scene graph prediction against the state-of-the-art baselines, the GPI variants had only marginal higher Recall@K scores. The true benefit of this paper's discovery is the avoidance of maximizing a score function (leading computationally difficult problem), and instead directly producing output invariant to how we represent the pairwise terms.<br />
<br />
=References=<br />
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson, Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction, 2018.<br />
<br />
Additional resources from Moshiko Raboh's [https://github.com/shikorab/SceneGraph GitHub]</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mapping_Images_to_Scene_Graphs_with_Permutation-Invariant_Structured_Prediction&diff=40824Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction2018-11-22T06:25:02Z<p>X832zhan: /* Structured prediction */</p>
<hr />
<div>The paper ''Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction'' was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as of May 2018) obtained from [https://arxiv.org/pdf/1802.05451v3.pdf arXiv] <br />
<br />
(*) Equal contribution<br />
<br />
=Motivation=<br />
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance and proves state of the art performance on models that follow this principle.<br />
<br />
The primary contributions that this paper makes include:<br />
# Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures<br />
# Empirically proving the benefit of graph-permutation invariance<br />
# Developing a state-of-the-art model for scene graph predictions over a large set of complex visual scenes<br />
<br />
=Introduction=<br />
In order to make a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A '''scene graph''' is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes, relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.<br />
<br />
[[File:scene_graph_example.png|600px|center]]<br />
<br />
Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.<br />
<br />
In structured prediction models, a score function <math>s(x, y)</math> is defined to evaluate the compatibility between label <math>y</math> and input <math>x</math>. For instance, when interpreting the scene of an image, <math>x</math> refers to the image itself, and <math>y</math> refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label <math>y^*</math> such that <math>s(x,y)</math> is maximized, <math> y^*=argmax_y s(x,y)</math>. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function. <br />
<br />
The paper presents an alternative approach, for which input <math>x</math> is mapped to structured output <math>y</math> using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.<br />
<br />
The model is evaluated by firstly demonstrating the importance of permutation invariance on a synthetic data set. The approach laid out by the authors is then shown to respect permutation invariance, and results are compared to a competitive benchmark. This method achieves state-of-the-art results.<br />
<br />
=Structured prediction=<br />
This paper further considers structured predictions using score-based methods. For structured predictions that follow a score-based approach, a score function <math>s(x, y)</math> is used to measure how compatible label <math>y</math> is for input <math>x</math> and is also used to infer a label by maximizing <math>s(x, y)</math>. To optimize the score function, previous works have decomposed <math>s(x,y) = \sum_i f_i(x,y)</math> in order to facilitate efficient optimization which is done by optimizing the local score function, <math>\max_y f_i(x,y)</math>, with a small subset of the <math>y</math> variables.<br />
<br />
Recently, modeling the <math>f_i </math> functions as deep networks is a new interest. In such area of structured predictions, the most commonly-used score functions include the singleton score function <math>f_i(y_i, x)</math> and pairwise score function <math>f_{ij} (y_i, y_j, x)</math>. Previous works explored a two-stage architectures (learn local scores independently of the structured prediction goal), end-to-end architectures (to include the inference algorithm within the computation graph), and modelling global factors. <br />
<br />
==Advantages of using score-based methods==<br />
# Allow for intuitive specification of local dependencies between labels, and how they map to global dependencies<br />
# Linear score functions offer natural convex surrogates<br />
# Inference in large label space is sometimes possible via exact algorithms or empirically accurate approximations<br />
<br />
The concern for modelling score functions using deep networks is that learning may no longer be convex. Hence, the paper presents properties for how deep networks can be used for structured predictions by considering architectures that do not require explicit maximization of a score function.<br />
<br />
=Background, Notations, and Definitions=<br />
We denote <math>y</math> as a structured label where <math>y = [y_1, \dots, y_n]</math><br />
<br />
'''Score functions:''' for score-based methods, the score is defined as either the sum of a set of singleton scores <math>f_i = f_i(y_i, x)</math> or the sum of pairwise scores <math>f_{ij} = f_{ij}(y_i, y_j, x)</math>.<br />
<br />
Let <math>s(x,y)</math> be the score of a score-based method. Then:<br />
<br />
<div align="center"><br />
<math>s(x,y) = \begin{cases}<br />
\sum_i f_i ~ \text{if we have a set of singleton scores}\\<br />
\sum_{ij} f_{ij} ~ \text{if we have a set of pairwise scores } \\<br />
\end{cases}</math><br />
</div><br />
<br />
'''Inference algorithm:''' an inference algorithm takes input set of local scores (either <math>f_i</math> or <math>f_{ij}</math>) and outputs an assignment of labels <math>y_1, \dots, y_n</math> that maximizes score function <math>s(x,y)</math><br />
<br />
'''Graph labeling function:''' a graph labeling function <math>\mathcal{F} : (V,E) \rightarrow Y</math> is a function that takes input of: an ordered set of node features <math>V = [z_1, \dots, z_n]</math> and an ordered set of edge features <math>E = [z_{1,2},\dots,z_{i,j},\dots,z_{n,n-1}]</math> to output set of node labels <math>\mathbf{y} = [y_1, \dots, y_n]</math>. For instance, <math>z_i</math> can be set equal to <math>f_i</math> and <math>z_{ij}</math> can be set equal to <math>f_{ij}</math>.<br />
<br />
For convenience, the joint set of nodes and edges will be denoted as <math>\mathbf{z}</math> to be a size <math>n^2</math> vector (<math>n</math> nodes and <math>n(n-1)</math> edges).<br />
<br />
'''Permutation:''' Let <math>z</math> be a set of node and edge features. Given a permutation <math>\sigma</math> of <math>\{1,\dots,n\}</math>, let <math>\sigma(z)</math> be a new set of node and edge features given by [<math>\sigma(z)]_i = z_{\sigma(i)}</math> and <math>[\sigma(z)]_{i,j} = z_{\sigma(i), \sigma(j)}</math><br />
<br />
'''One-hot representation:''' <math>\mathbf{1}[j]</math> be a one-hot vector with 1 in the <math>j^{th}</math> coordinate<br />
<br />
=Permutation-Invariant Structured prediction=<br />
<br />
With permutation-invariant structured prediction, we would expect the algorithm to produce the same result given the same score function. For instance, consider the case where we have label space for 3 variables <math>y_1, y_2, y_3</math> with input <math>\mathbf{z} = (f_1, f_2, f_3, f_{12}, f_{13}, f_{23})</math> that outputs label <math>\mathbf{y} = (y_1^*, y_2^*, y_3^*)</math>. Then if the algorithm is run on a permuted version input <math>z' = (f_2, f_1, f_3, f_{21}, f_{23}, f_{13})</math>, we would expect <math>\mathbf{y} = (y_2^*, y_1^*, y_3^*)</math> given the same score function.<br />
<br />
'''Graph permutation invariance (GPI):''' a graph labeling function <math>\mathcal{F}</math> is graph-permutation invariant, if for all permutations <math>\sigma</math> of <math>\{1, \dots, n\}</math> and for all nodes <math>z</math>, <math>\mathcal{F}(\sigma(\mathbf{z})) = \sigma(\mathcal{F}(\mathbf{z}))</math><br />
<br />
The paper presents a theorem on the necessary and sufficient conditions for a function <math>\mathcal{F}</math> to be graph permutation invariant. Intuitively, because <math>\mathcal{F}</math> is a function that takes an ordered set <math>z</math> as input, the output on <math>\mathbf{z}</math> could very well be different from <math>\sigma(\mathbf{z})</math>, which means <math>\mathcal{F}</math> needs to have some sort of symmetry in order to sustain <math>[\mathcal{F}(\sigma(\mathbf{z}))]]_k = [\mathcal{F}(\mathbf{z})]_{\sigma(k)}</math>.<br />
<br />
[[File:graph_permutation_invariance.jpg|400px|center]]<br />
<br />
==Theorem 1==<br />
Let <math>\mathcal{F}</math> be a graph labeling function. Then <math>\mathcal{F}</math> is graph-permutation invariant if and only if there exist functions <math>\alpha, \rho, \phi</math> such that for all <math>k=1, .., n</math>:<br />
\begin{align}<br />
[\mathcal{F}(\mathbf{z})]_k = \rho(\mathbf{z}_k, \sum_{i=1}^n \alpha(\mathbf{z}_i, \sum_{i\neq j} \phi(\mathbf{z}_i, \mathbf{z}_{i,j}, \mathbf{z}_j)))<br />
\end{align}<br />
where <math>\phi: \mathbb{R}^{2d+e} \rightarrow \mathbb{R}^L, \alpha: \mathbb{R}^{d + L} \rightarrow \mathbb{R}^{W}, p: \mathbb{R}^{W+d} \rightarrow \mathbb{R}</math>.<br />
<br />
Notice that for the dimensions of inputs and outputs, <math>d</math> refers to the number of singleton features in <math>z</math> and <math>e</math> refers to the number of edges. <br />
<br />
[[File:GPI_architecture.jpg|thumb|A schematic representation of the GPI architecture. Singleton features <math>z_i</math> are omitted for simplicity. First, the features <math>z_{i,j}</math> are processed element-wise by <math>\phi</math>. Next, they are summed to create a vector <math>s_i</math>, which is concatenated with <math>z_i</math>. Third, a representation of the entire graph is created by applying <math>\alpha\ n</math> times and summing the created vector. The graph representation is then finally processed by <math>\rho</math> together with <math>z_k</math>.|600px|center]]<br />
<br />
==Proof Sketch for Theorem 1==<br />
The proof of this theorem can be found in the paper. A proof sketch is provided below:<br />
<br />
'''For the forward direction''' (function that follows the form set out in equation (1) is GPI):<br />
# Using definition of permutation <math>\sigma</math>, and rewriting <math>[F(z)]_{\sigma(k)}</math> in the form from equation (1)<br />
# Second argument of <math>\rho</math> is invariant under <math>\sigma</math>, since it takes the sum of all indices <math>i</math> and all other indices <math>j \neq i </math>.<br />
<br />
'''For the backward direction''' (any black-box GPI function can be expressed in the form of equation 1):<br />
# Construct <math>\phi, \alpha</math> such that second argument of <math>\rho</math> contains all information about graph features of <math>z</math>, including edges that the features originate from<br />
# Assume each <math>z_k</math> uniquely identifies the node and <math>\mathcal{F}</math> is a function only of pairwise features <math>z_{i,j}</math><br />
# Construct <math>H</math> be a perfect hash function with <math>L</math> buckets, and <math>\phi</math> which maps '''pairwise features''' to a vector of size <math>L</math><br />
# <math>*</math>Construct <math>\phi(z_i, z_{i,j}, z_j) = \mathbf{1}[H(z_j)] z_{i,j}</math>, which intuitively means that <math>\phi</math> stores <math>z_{i,j}</math> in the unique bucket for node <math>j</math><br />
# Construct function <math>\alpha</math> to output a matrix <math>\mathbb{R}^{L \times L}</math> that maps each pairwise feature into unique positions (<math>\alpha(z_i, s_i) = \mathbf{1}[H(z_i)]s_i^T</math>)<br />
# Construct matrix <math>M = \sum_i \alpha(z_i,s_i)</math> by discarding rows/columns in <math>M</math> that do not correspond to original nodes (which reduces dimension to <math>n\times n</math>; set <math>\rho</math> to have same outcome as <math>\mathcal{F}</math>, and set the output of <math>\mathcal{F}</math> on <math>M</math> to be the labels <math>\mathbf{y} = y_1, \dots, y_n</math><br />
<br />
<math>*</math>The paper presents the proof for the edge features <math>z_{ij}</math> being scalar (<math>e = 1</math>) for simplicity, which can be extended easily to vectors with additional indexing.<br />
<br />
Although the results discussed previously apply to complete graphs (edges apply to all feature pairs), it can be easily extended to incomplete graphs. However, in place of permutation-invariance, it is now an automorphism-invariance.<br />
<br />
==Implications and Applications of Theorem 1==<br />
===Key Implications of Theorem 1===<br />
# Architecture "collects" information from the different edges of the graph, and does so in an invariant fashion using <math>\alpha</math> and <math>\phi</math><br />
# Architecture is parallelizable, since all <math>\phi</math> functions can be applied simultaneously<br />
<br />
===Some applications of Theorem 1===<br />
# '''Attention:''' the concept of attention can be implemented in the GPI characterization, with slight alterations to the functions <math>\alpha</math> and <math>\phi</math>. In attention each node aggregates features of neighbours through a function of neighbour's relevance. Which means the lable of an entity could depend strongly on its close entity. The complete details can be found in the supplementary materials of the paper.<br />
<br />
# '''RNN:''' recurrent architectures can maintain GPI property, since all GPI function <math>\mathcal{F}</math> are closed under composition. The output of one step after running <math>\mathcal{F}</math> will act as input for the next step, but maintain the GPI property throughout.<br />
<br />
=Related Work=<br />
# '''Architectural invariance:''' suggested recently in a 2017 paper called Deep Sets by Zaheer et al., which considers the case of invariance that is more restrictive.<br />
# '''Deep structured prediction:''' previous work applied deep learning to structured prediction, for instance, semantic segmentation. Some algorithms include message passing algorithms, gradient descent for maximizing score functions, greedy decoding (inference of labels based on time of previous labels). Apart from those algorithms, deep learning has been applied to other graph-based problems such as the Travelling Salesman Problem (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017). However, none of the previous work specifically address the notion of invariance in the general architecture, but rather focus on message passing architectures that can be generalized by this paper.<br />
# '''Scene graph prediction:''' scene graph extraction allows for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017). Some other works in this area include object detection, action recognition, and even detection of human-object interactions (Liao et al., 2016; Plummer et al., 2017). Additional work has been done with the use of message passing algorithms (Xu et al., 2017), word embeddings (Lu et al., 2016), and end-to-end prediction directly from pixels (Newell & Deng, 2017). A notable mention is NeuralMotif (Zellers et al., 2017), which the authors describe as the current state-of-the-art model for scene graph predictions on Visual Genome dataset.<br />
# '''Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks:''' similar ideas were applied, where Permutation Invariant CNN, are used to restore sharp and noise-free images from bursts of photographs affected by hand tremor and noise. This presented good quality images with lots of details for challenging datasets.<br />
<br />
=Experimental Results=<br />
==Synthetic Graph Labeling==<br />
The authors created a synthetic problem to study GPI. This involved using an input graph <math>G = (V,E)</math> where each node <math>i</math> belongs to the set <math>\Gamma(i) \in \{1, \dots, K\}</math> where <math>K</math> is the number of samples. The task is to compute for each node, the number of neighbours that belong to the same set (i.e. finding the label of the node <math>i</math> if <math>y_i = \sum_{j \in N(i)} \mathbf{1}[\Gamma(i) = \Gamma(j)]</math>) . Then, random graphs (each with 10 nodes) were generated by sampling edges, and the set <math>\Gamma(i) \in \{1, \dots, K\}</math>for each node independently and uniformly.<br />
The node features of the graph <math>z_i \in \{0,1\}^K</math> are one-hot vectors of <math>\Gamma(i)</math>, and each pairwise edge feature <math>z_{ij} \in \{0, 1\}</math> denote whether the edge <math>ij</math> is in the edge set <math>E</math>. <br />
3 architectures were studied in this paper:<br />
# '''GPI-architecture for graph prediction''' (without attention and RNN)<br />
# '''LSTM''': replacing <math>\sum \phi(\cdot)</math> and <math>\sum \alpha(\cdot)</math> in the form of Theorem 1 using two LSTMs with state size 200, reading their input in random order<br />
# '''Fully connected feed-forward network''': with 2 hidden layers, each layer containing 1,000 nodes; the input is a concatenation of all nodes and pairwise features, and the output is all node predictions<br />
<br />
The results show that the GPI architecture requires far fewer samples to converge to the correct solution.<br />
[[File:GPI_synthetic_example.jpg|450px|center]]<br />
<br />
==Scene-Graph Classification==<br />
Applying the concept of GPI to Scene-Graph Prediction (SGP) is the main task of this paper. The input to this problem is an image, along with a set of annotated bounding boxes for the entities in the image. The goal is to correctly label each entity within the bounding boxes and the relationship between every pair of entities, resulting in a coherent scene graph.<br />
<br />
The authors describe two different types of variables to predict. The first type is entity variables <math>[y_1, \dots, y_n]</math> for all bounding boxes, where each <math>y_i</math> can take one of L values and refers to objects such as "dog" or "man". The second type is relation variables <math>[y_{n+1}, \cdots, y_{n^2}]</math>, where each <math>y_i</math> represents the relation (e.g. "on", "below") between a pair of bounding boxes (entities).<br />
<br />
The scene graph and contain two types of edges:<br />
# '''Entity-entity edge''': connecting two entities <math>y_i</math> and <math>y_j</math> for <math>1 \leq i \neq j \leq n</math><br />
# '''Entity-relation edges''': connecting every relation variable <math>y_k</math> for <math>k > n</math> to two entities<br />
<br />
The feature set <math>\mathbf{z}</math> is based on the baseline model from Zellers et al. (2017). For entity variables <math>y_i</math>, the vector <math>\mathbf{z}_i \in \mathbb{R}^L</math> models the probability of the entity appearing in <math>y_i</math>. <math>\mathbf{z}_i</math> is augmented by the coordinates of the bounding box. Similarly for relation variables <math>y_j</math>, the vector <math>\mathbf{z}_j \in \mathbb{R}^R</math>, models the probability of the relations between the two entities in <math>j</math>. For entity-entity pairwise features <math>\mathbf{z}_{i,j}</math>, there is a similar representation of the probabilities for the pair. The SGP outputs probability distributions over all entities and relations, which will then be used as input recurrently to maintain GPI. Finally, word embeddings are used and concatenated for the most probable entity-relation labels.<br />
<br />
'''Components of the GPI architecture''' (ent for entity, rel for relation)<br />
# <math>\phi_{ent}</math>: network that integrates two entity variables <math>y_i</math> and <math>y_j</math>, with input <math>z_i, z_j, z_{i,j}</math> and output vector of <math>\mathbb{R}^{n_1}</math> <br />
# <math>\alpha_{ent}</math>: network with inputs from <math>\phi_{ent}</math> for all neighbours of an entity, and uses attention mechanism to output vector <math>\mathbb{R}^{n_2}</math> <br />
# <math>\rho_{ent}</math>: network with inputs from the various <math>\mathbb{R}^{n_2}</math> vectors, and outputs <math>L</math> logits to predict entity value<br />
# <math>\rho_{rel}</math>: network with inputs <math>\alpha_{ent}</math> of two entities and <math>z_{i,j}</math>, and output into <math>R</math> logits<br />
<br />
==Set-up and Results==<br />
'''Dataset''': based on Visual Genome (VG) by (Krishna et al., 2017), which contains a total of 108,077 images annotated with bounding boxes, entities, and relations. An average of 12 entities and 7 relations exist per image. For a fair comparison with previous works, data from (Xu et al., 2017) for train and test splits were used. The authors used the same 150 entities and 50 relations as in (Xu et al., 2017; Newell & Deng, 2017; Zellers et al., 2017). Hyperparameters were tuned using a 70K/5K/32K split for training, validation, and testing respectively.<br />
<br />
'''Training''': all networks were trained using the Adam optimizer, with a batch size of 20. The loss function was the sum of cross-entropy losses over all of entities and relations. Penalties for misclassified entities were 4 times stronger than that of relations. Penalties for misclassified negative relations were 10 times weaker than that of positive relations.<br />
<br />
'''Evaluation''': there are three major tasks when inferring from the scene graph. The authors focus on the following:<br />
# '''SGCIs''': given ground-truth entity bounding boxes, predict all entity and relations categories<br />
# '''PredCIs''': given annotated bounding boxes with entity labels, predict all relations<br />
<br />
The evaluation metric Recall@K (shortened to R@K) is drawn from (Lu et al., 2016). This metric is the fraction of correct ground-truth triplets that appear within the <math>K</math> most confident triplets predicted by the model. Graph-constrained protocol requires the top-<math>K</math> triplets to assign one consistent class per entity and relation. The unconstrained protocol does not enforce such constraint.<br />
<br />
'''Models and baselines''': The authors compared variants of the GPI approach against four baselines, state-of-the-art models on completing scene graph sub-tasks. To maintain consistency, all models used the same training/testing data split, in addition to the preprocessing as per (Xu et al., 2017).<br />
<br />
'''Baselines from existing state-of-the-art models'''<br />
# (Lu et al., 2016): use of word embeddings to fine-tune the likelihood of predicted relations<br />
# (Xu et al., 2017): message passing algorithm between entities and relations to iteratively improve feature map for prediction<br />
# (Newell & Deng, 2017): Pixel2Graph, uses associative embeddings to produce a full graph from image<br />
# (Zellers et al., 2017): NeuralMotif method, encodes global context to capture higher-order motif in scene graphs; Baseline outputs entities and relations distributions without using global context<br />
<br />
'''GPI models'''<br />
# '''GPI with no attention mechanism''': simply following Theorem 1's functional form, with summation over features<br />
# '''GPI NeighborAttention''': same GPI model, but considers attention over neighbours features<br />
# '''GPI Linguistic''': similar to NeighborAttention model, but concatenates word embedding vectors<br />
<br />
'''Key Results''': The GPI Linguistic approach outperforms all baseline for SGCIs, and has similar performance to the state of the art NeuralMotifs method. The authors argue that PredCI is an easier task with less structure, yielding high performance for the existing state of the art models.<br />
<br />
[[File:GPI_table_results.png|700px|center]]<br />
<br />
=Conclusion=<br />
<br />
A deep learning approach was presented in this paper to structured prediction, which constrains the architecture to be invariant to structurally identical inputs. This approach relies on pairwise features which are capable of describing inter-label correlations and inherits the intuitive aspect of score-based approaches. The output produced is invariant to equivalent representation of the pairwise terms. <br />
<br />
As future work, the axiomatic approach can be extended; for example in image labeling, geometric variances such as shifts or rotations may be desired (or in other cases invariance to feature permutations may be desired). Additionally, exploring algorithms that discover symmetries for deep structured prediction when invariant structure is unknown and should be discovered from data is also an interesting extension of this work.<br />
<br />
=Critique=<br />
The paper's contribution comes from the novelty of the permutation invariance as a design guideline for structured prediction. Although not explicitly considered in many of the previous works, the idea of invariance in architecture has already been considered in Deep Sets by (Zaheer et al., 2017). This paper characterizes relaxes the condition on the invariance as compared to that of previous works. In the evaluation of the benefit of GPI models, the paper used a synthetic problem to illustrate the fact that far fewer samples are required for the GPI model to converge to 100% accuracy. However, when comparing the true task of scene graph prediction against the state-of-the-art baselines, the GPI variants had only marginal higher Recall@K scores. The true benefit of this paper's discovery is the avoidance of maximizing a score function (leading computationally difficult problem), and instead directly producing output invariant to how we represent the pairwise terms.<br />
<br />
=References=<br />
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson, Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction, 2018.<br />
<br />
Additional resources from Moshiko Raboh's [https://github.com/shikorab/SceneGraph GitHub]</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18&diff=40626stat946F182018-11-21T02:23:13Z<p>X832zhan: /* Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing] */</p>
<hr />
<div>== [[F18-STAT946-Proposal| Project Proposal ]] ==<br />
<br />
=Paper presentation=<br />
<br />
[https://goo.gl/forms/8NucSpF36K6IUZ0V2 Your feedback on presentations]<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1SxkjNfhOg_eXWpUnVHuIP93E6tEiXEdpm68dQGencgE/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
<br />
<br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Unsupervised_Machine_Translation_Using_Monolingual_Corpora_Only Summary]]<br />
|-<br />
|Oct 25 || Dhruv Kumar || 1 || Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs || [https://openreview.net/pdf?id=rkRwGg-0Z Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Beyond_Word_Importance_Contextual_Decomposition_to_Extract_Interactions_from_LSTMs Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/images/e/ea/Beyond_Word_Importance.pdf Slides]<br />
|-<br />
|Oct 25 || Amirpasha Ghabussi || 2 || DCN+: Mixed Objective And Deep Residual Coattention for Question Answering || [https://openreview.net/pdf?id=H1meywxRW Paper] ||<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DCN_plus:_Mixed_Objective_And_Deep_Residual_Coattention_for_Question_Answering Summary]<br />
|-<br />
|Oct 25 || Juan Carrillo || 3 || Hierarchical Representations for Efficient Architecture Search || [https://arxiv.org/abs/1711.00436 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/Hierarchical_Representations_for_Efficient_Architecture_Search Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/images/1/15/HierarchicalRep-slides.pdf Slides]<br />
|-<br />
|Oct 30 || Manpreet Singh Minhas || 4 || End-to-end Active Object Tracking via Reinforcement Learning || [http://proceedings.mlr.press/v80/luo18a/luo18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=End_to_end_Active_Object_Tracking_via_Reinforcement_Learning Summary]<br />
|-<br />
|Oct 30 || Marvin Pafla || 5 || Fairness Without Demographics in Repeated Loss Minimization || [http://proceedings.mlr.press/v80/hashimoto18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Fairness_Without_Demographics_in_Repeated_Loss_Minimization Summary]<br />
|-<br />
|Oct 30 || Glen Chalatov || 6 || Pixels to Graphs by Associative Embedding || [http://papers.nips.cc/paper/6812-pixels-to-graphs-by-associative-embedding Paper] ||<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pixels_to_Graphs_by_Associative_Embedding Summary]<br />
|-<br />
|Nov 1 || Sriram Ganapathi Subramanian || 7 ||Differentiable plasticity: training plastic neural networks with backpropagation || [http://proceedings.mlr.press/v80/miconi18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F18/differentiableplasticity Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/images/3/3c/Deep_learning_course_presentation.pdf Slides]<br />
|-<br />
|Nov 1 || Hadi Nekoei || 8 || Synthesizing Programs for Images using Reinforced Adversarial Learning || [http://proceedings.mlr.press/v80/ganin18a.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Synthesizing_Programs_for_Images_usingReinforced_Adversarial_Learning Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Synthesizing_Programs_for_Images_using_Reinforced_Adversarial_Learning.pdf Slides]<br />
|-<br />
|Nov 1 || Henry Chen || 9 || DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks || [https://ieeexplore.ieee.org/abstract/document/7989236 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DeepVO_Towards_end_to_end_visual_odometry_with_deep_RNN Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DeepVO_Presentation_Henry.pdf Slides] <br />
|-<br />
|Nov 6 || Nargess Heydari || 10 ||Wavelet Pooling For Convolutional Neural Networks Networks || [https://openreview.net/pdf?id=rkhlb8lCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Wavelet_Pooling_For_Convolutional_Neural_Networks Summary] [https://wiki.math.uwaterloo.ca/statwiki/images/1/1a/Wavelet_Pooling_for_Convolutional_Neural_Networks.pptx Slides]<br />
|-<br />
|Nov 6 || Aravind Ravi || 11 || Towards Image Understanding from Deep Compression Without Decoding || [https://openreview.net/forum?id=HkXWCMbRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Towards_Image_Understanding_From_Deep_Compression_Without_Decoding Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:DL_STAT946_PPT_AravindRavi.pdf Slides]<br />
|-<br />
|Nov 6 || Ronald Feng || 12 || Learning to Teach || [https://openreview.net/pdf?id=HJewuJWCZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Teach Summary]<br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:946_L2T_slides.pdf Slides]<br />
|-<br />
|Nov 8 || Neel Bhatt || 13 || Annotating Object Instances with a Polygon-RNN || [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN Summary] [https://wiki.math.uwaterloo.ca/statwiki/images/a/af/ANNOTATING_OBJECT_INSTANCES_NEEL_BHATT.pdf Slides]<br />
|-<br />
|Nov 8 || Jacob Manuel || 14 || Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels || [https://arxiv.org/pdf/1804.06872.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Co-Teaching Summary] [https://wiki.math.uwaterloo.ca/statwiki/images/3/33/Co-Teaching.pdf Slides]<br />
|-<br />
|Nov 8 || Charupriya Sharma|| 15 || A Bayesian Perspective on Generalization and Stochastic Gradient Descent|| [https://openreview.net/pdf?id=BJij4yg0Z Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_Bayesian_Perspective_on_Generalization_and_Stochastic_Gradient_Descent Summary]<br />
|-<br />
|NOv 13 || Sagar Rajendran || 16 || Zero-Shot Visual Imitation || [https://openreview.net/pdf?id=BkisuzWRW Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Zero-Shot_Visual_Imitation Summary]<br />
|-<br />
<br />
|Nov 13 || Ruijie Zhang || 17 || Searching for Efficient Multi-Scale Architectures for Dense Image Prediction || [https://arxiv.org/pdf/1809.04184.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Searching_For_Efficient_Multi_Scale_Architectures_For_Dense_Image_Prediction Summary]<br />
|-<br />
|Nov 13 || Neil Budnarain || 18 || Predicting Floor Level For 911 Calls with Neural Networks and Smartphone Sensor Data || [https://openreview.net/pdf?id=ryBnUWb0b Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Predicting_Floor_Level_For_911_Calls_with_Neural_Network_and_Smartphone_Sensor_Data Summary]<br />
|-<br />
|NOv 15 || Zheng Ma || 19 || Reinforcement Learning of Theorem Proving || [https://arxiv.org/abs/1805.07563 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Reinforcement_Learning_of_Theorem_Proving Summary] [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:zheng_946_presentation.pdf Slides]<br />
|-<br />
|Nov 15 || Abdul Khader Naik || 20 || Multi-View Data Generation Without View Supervision || [https://openreview.net/pdf?id=ryRh0bb0Z Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=MULTI-VIEW_DATA_GENERATION_WITHOUT_VIEW_SUPERVISION Summary]<br />
|-<br />
|Nov 15 || Johra Muhammad Moosa || 21 || Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin || [https://papers.nips.cc/paper/7255-attend-and-predict-understanding-gene-regulation-by-selective-attention-on-chromatin.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Attend_and_Predict:_Understanding_Gene_Regulation_by_Selective_Attention_on_Chromatin Summary] [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Attend_and_Predict.pdf Slides]<br />
|-<br />
|NOv 20 || Zahra Rezapour Siahgourabi || 22 ||Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias ||[https://arxiv.org/pdf/1807.07049 Paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Robot_Learning_in_Homes:_Improving_Generalization_and_Reducing_Dataset_Bias Summary]<br />
|-<br />
|Nov 20 || Shubham Koundinya || 23 || Countering Adversarial Images Using Input Transformations ||[https://openreview.net/pdf?id=SyJ7ClWCb paper] || <br />
[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Countering_Adversarial_Images_Using_Input_Transformations Summary]<br />
|-<br />
|Nov 20 || Salman Khan || 24 || Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples || [http://proceedings.mlr.press/v80/athalye18a.html paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Obfuscated_Gradients_Give_a_False_Sense_of_Security_Circumventing_Defenses_to_Adversarial_Examples Summary]<br />
|-<br />
|NOv 22 ||Soroush Ameli || 25 || Learning to Navigate in Cities Without a Map || [https://arxiv.org/abs/1804.00168 paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_to_Navigate_in_Cities_Without_a_Map Summary] <br />
|-<br />
|Nov 22 ||Ivan Li || 26 || Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction || [https://arxiv.org/pdf/1802.05451v3.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mapping_Images_to_Scene_Graphs_with_Permutation-Invariant_Structured_Prediction Summary]<br />
|-<br />
|Nov 22 ||Sigeng Chen || 27 ||Conditional Neural Processes || [http://proceedings.mlr.press/v80/garnelo18a/garnelo18a.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=conditional_neural_process Summary]<br />
|-<br />
|Nov 27 || Aileen Li || 28 || Unsupervised Neural Machine Translation ||[https://openreview.net/pdf?id=Sy2ogebAW Paper] || <br />
|-<br />
|Nov 27 ||Xudong Peng || 29 || Visual Reinforcement Learning with Imagined Goals || [https://arxiv.org/abs/1807.04742 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Visual_Reinforcement_Learning_with_Imagined_Goals Summary]<br />
|-<br />
|Nov 27 ||Xinyue Zhang || 30 || Policy Optimization with Demonstrations || [http://proceedings.mlr.press/v80/kang18a/kang18a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations Summary]<br />
|-<br />
|-<br />
|NOv 29 ||Junyi Zhang || 31 || Autoregressive Convolutional Neural Networks for Asynchronous Time Series || [http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Paper] ||<br />
|-<br />
|Nov 29 ||Travis Bender || 32 || Automatic Goal Generation for Reinforcement Learning Agents || [http://proceedings.mlr.press/v80/florensa18a/florensa18a.pdf Paper] ||<br />
|-<br />
|Nov 29 ||Patrick Li || 33 || Matrix Capsules with EM Routing || [https://openreview.net/pdf?id=HJWLfGWRb Paper] ||<br />
|-<br />
|Nov 30 || Jiazhen Chen || 34 || Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning || [https://arxiv.org/abs/1809.02121 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=learn_what_not_to_learn Summary]<br />
|-<br />
|Nov 30 || Gaurav Sahu || 35 || TBD || ||<br />
|-<br />
|Nov 23 || Kashif Khan || 36 || Wasserstein Auto-Encoders || [https://arxiv.org/pdf/1711.01558.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Wasserstein_Auto-encoders Summary]<br />
|-<br />
|Nov 23 || Shala Chen || 37 || A Neural Representation of Sketch Drawings || [https://arxiv.org/pdf/1704.03477.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_neural_representation_of_sketch_drawings Summary]<br />
|-<br />
|Nov 30 || Ki Beom Lee || 38 || Detecting Statistical Interactions from Neural Network Weights|| [https://openreview.net/forum?id=ByOfBggRZ Paper] ||<br />
|-<br />
|Nov 23 || Wesley Fisher || 39 || Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling || [http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Reinforcement_Learning_in_Continuous_Action_Spaces_a_Case_Study_in_the_Game_of_Simulated_Curling Summary]<br />
|-<br />
||Nov 30|| Ahmed Afify || 40 ||Don't Decay the Learning Rate, Increase the Batch Size || [https://openreview.net/pdf?id=B1Yy1BxCZ Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DON'T_DECAY_THE_LEARNING_RATE_,_INCREASE_THE_BATCH_SIZE Summary]<br />
|</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40621policy optimization with demonstrations2018-11-21T02:21:51Z<p>X832zhan: /* Results */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the occupancy measure matching objective can be written as:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards and surpasses other strong baselines.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40617policy optimization with demonstrations2018-11-21T02:18:12Z<p>X832zhan: /* Optimization */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the occupancy measure matching objective can be written as:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40616policy optimization with demonstrations2018-11-21T02:16:55Z<p>X832zhan: /* Optimization */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the occupancy measure matching objective can be written as:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40615policy optimization with demonstrations2018-11-21T02:16:36Z<p>X832zhan: /* Optimization */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the occupancy measure matching objective can be written as:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40612policy optimization with demonstrations2018-11-21T02:12:30Z<p>X832zhan: /* Benefits of Exploration with Demonstrations */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40610policy optimization with demonstrations2018-11-21T02:04:14Z<p>X832zhan: /* Preliminaries */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40609policy optimization with demonstrations2018-11-21T02:03:18Z<p>X832zhan: /* Preliminaries */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from states to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40604policy optimization with demonstrations2018-11-21T01:53:43Z<p>X832zhan: /* Related Work */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math>is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from states to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40603policy optimization with demonstrations2018-11-21T01:47:03Z<p>X832zhan: /* Policy Optimization with Demonstration (POfD) */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For Learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math>is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from states to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40602policy optimization with demonstrations2018-11-21T01:44:17Z<p>X832zhan: </p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.<br />
<br />
For Learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math>is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from states to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations DE. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO [17] in dense environments, which is called expert <br />
* training the policy with TRPO [17] in sparse environments<br />
* applying GAIL [14] to learn the policy from demonstrations<br />
* DQfD [2]<br />
* DDPGfD [3]<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.<br />
<br />
=References=<br />
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.<br />
<br />
[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.<br />
<br />
[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.<br />
<br />
[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.<br />
<br />
[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.<br />
<br />
[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.<br />
<br />
[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.<br />
<br />
[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.<br />
<br />
[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.<br />
<br />
[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.<br />
<br />
[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.<br />
<br />
[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.<br />
<br />
[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.<br />
<br />
[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.<br />
<br />
[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.<br />
<br />
[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.<br />
<br />
[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.<br />
<br />
[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.<br />
<br />
[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.<br />
<br />
[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.<br />
<br />
[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.<br />
<br />
[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40585policy optimization with demonstrations2018-11-21T01:14:07Z<p>X832zhan: /* Critique */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory or using the demonstration trajectory to pre-train the policy in a supervised manner. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration and imitation learning in RL.<br />
<br />
For Learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD that is applied into the discrete action spaces and DDPGfD that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration, which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning problems are solved by alternating between fitting the reward function and selecting the policy. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math>is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from states to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations DE. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods.<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs). The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym and Mujoco. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO in dense environments, which is called expert <br />
* training the policy with TRPO in sparse environments<br />
* applying GAIL to learn the policy from demonstrations<br />
* DQfD<br />
* DDPGfD<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO and DQFD fail to explore and GAIL converes to the imperfect demonstration in MountainCar.<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.</div>X832zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=policy_optimization_with_demonstrations&diff=40583policy optimization with demonstrations2018-11-21T01:13:37Z<p>X832zhan: /* Conclusion */</p>
<hr />
<div>= Introduction =<br />
<br />
==Introduction==<br />
The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory or using the demonstration trajectory to pre-train the policy in a supervised manner. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.<br />
<br />
==Intuition==<br />
The agent should imitate the demonstrated behavior when rewards are sparse and then explores new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL.<br />
<br />
=Related Work =<br />
There are some relates works in overcoming exploration difficulties by learning from demonstration and imitation learning in RL.<br />
<br />
For Learning from demonstration (LfD),<br />
# Most LfD methods adopt value-based RL algorithms, such as DQfD that is applied into the discrete action spaces and DDPGfD that is extends to the continuous spaces. But both of them underutilize the demonstration data.<br />
#There are some methods based on policy iteration, which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.<br />
# A hybrid framework that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.<br />
# A reward reshaping mechanism that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.<br />
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.<br />
<br />
For imitation learning, <br />
# Inverse Reinforce Learning problems are solved by alternating between fitting the reward function and selecting the policy. But it cannot be extended to big-scale problems.<br />
# Generative Adversarial Imitation Learning (GAIL) uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.<br />
All of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.<br />
<br />
=Background=<br />
<br />
==Preliminaries==<br />
Markov Decision Process (MDP) is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math>is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from states to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>: <br />
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]<br />
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.<br />
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.<br />
[[File:def1.png|500px|center]]<br />
Then the performance of <math> \pi </math> can be rewritten to: <br />
[[File:equ2.png|500px|center]]<br />
At the same time, the authors propose a lemma: <br />
[[File:lemma1.png|500px|center]]<br />
<br />
==Problem Definition==<br />
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:<br />
[[File:asp1.png|500px|center]]<br />
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages. <br />
<br />
=Method=<br />
<br />
==Policy Optimization with Demonstration (POfD)==<br />
[[File:ff1.png|500px|center]]<br />
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations DE. Thus, a new learning objective is given:<br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]<br />
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is: <br />
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]<br />
<br />
==Benefits of Exploration with Demonstrations==<br />
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods.<br />
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s_t,a_t)]\]<br />
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by<br />
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:<br />
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]<br />
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,<br />
[[File:them1.png|500px|center]]<br />
<br />
==Optimization==<br />
<br />
For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>:<br />
[[File:them2.png|500px|center]]<br />
Thus, the lower bound of learning objective is:<br />
[[File:eqnlm.png|500px|center]]<br />
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.<br />
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math>as the regularization term. Thus, the learning objective is: <br />
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]<br />
At this point, the problem has been like Generative Adversarial Networks (GANs). The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:<br />
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]<br />
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:<br />
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]<br />
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.<br />
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:<br />
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]<br />
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:<br />
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]<br />
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.<br />
At the end, Algorithm 1 gives the detailed process.<br />
[[File:pofd.png|500px|center]]<br />
<br />
=Discussion on Existing LfD Methods=<br />
<br />
==DQFD==<br />
DQFD puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:<br />
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]<br />
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.<br />
<br />
==DDPGfD==<br />
DDPGfD also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework. The objective for DDPGfD is the same as DQFD. Its policy gradient is:<br />
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]<br />
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.<br />
<br />
=Experiments=<br />
<br />
==Goal==<br />
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare. <br />
<br />
==Settings==<br />
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym and Mujoco. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment. <br />
[[File:pofdt1.png|900px|center]]<br />
<br />
==Baselines==<br />
The authors compare POfD against 5 strong baselines:<br />
* training the policy with TRPO in dense environments, which is called expert <br />
* training the policy with TRPO in sparse environments<br />
* applying GAIL to learn the policy from demonstrations<br />
* DQfD<br />
* DDPGfD<br />
<br />
==Results==<br />
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO and DQFD fail to explore and GAIL converes to the imperfect demonstration in MountainCar.<br />
[[File:pofdf2.png|500px|center]]<br />
Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards.<br />
[[File:pofdf3.png|900px|center]]<br />
<br />
=Conclusion=<br />
A method that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback is proposed, that is POfD. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learning better policies.<br />
<br />
=Critique=<br />
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.<br />
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents that trained in the dense-reward environment.<br />
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.<br />
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?<br />
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.<br />
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.</div>X832zhan