statwiki - User contributions [US]

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T02:59:04Z

Vrajendr: /* RELATED WORK */

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

The authors consider four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation; the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is applying different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Wilson et al. (2017) argued that adaptive optimization methods tend to generalize less well than SGD and SGD with momentum (although
they did not include K-FAC in their study), while the authors' work reduces the gap in convergence speed.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter <math>m</math>, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

- It is proposed that we should compare learning rate decay with batch-size increase under the setting that total budget / number of training samples is fixed.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T02:56:34Z

Vrajendr: /* RELATED WORK */

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

The authors consider four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation; the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter <math>m</math>, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

- It is proposed that we should compare learning rate decay with batch-size increase under the setting that total budget / number of training samples is fixed.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

DON'T DECAY THE LEARNING RATE , INCREASE THE BATCH SIZE

2018-11-30T02:54:24Z

Vrajendr: /* INCREASING THE EFFECTIVE LEARNING RATE */

Summary of the ICLR 2018 paper: '''Don't Decay the learning Rate, Increase the Batch Size '''

Link: [https://arxiv.org/pdf/1711.00489.pdf]

Summarized by: Afify, Ahmed [ID: 20700841]

==INTUITION==
Nowadays, it is a common practice not to have a singular steady learning rate for the learning phase of neural network models. Instead, we use adaptive learning rates with the standard gradient descent method. The intuition behind this is that when we are far away from the minima, it is beneficial for us to take large steps towards the minima, as it would require a lesser number of steps to converge, but as we approach the minima, our step size should decrease, otherwise we may just keep oscillating around the minima. In practice, this is generally achieved by methods like SGD with momentum, Nesterov momentum, and Adam. However, the core claim of this paper is that the same effect can be achieved by increasing the batch size during the gradient descent process while keeping the learning rate constant throughout. In addition, the paper argues that such an approach also reduces the parameter updates required to reach the minima, thus leading to greater parallelism and shorter training times.

== INTRODUCTION ==
Although stochastic gradient descent (SGD) is widely used in deep learning training process due to finding minima that generalizes well(Zhang et al., 2016; Wilson et al., 2017), the optimization process is slow and takes lots of time. According to (Goyal et al., 2017; Hoffer et al., 2017; You et al., 2017a), this has motivated researchers to try to speed up this optimization process by taking bigger steps, and hence reduce the number of parameter updates in training a model by using large batch training, which can be divided across many machines.

However, increasing the batch size leads to decreasing the test set accuracy (Keskar et al., 2016; Goyal et al., 2017). Smith and Le (2017) believed that SGD has a scale of random fluctuations <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N number of training samples, and B batch size. They concluded that there is an optimal batch size proportional to the learning rate when <math> B \ll N </math>, and optimum fluctuation scale <math>g</math> for a maximum test set accuracy.

In this paper, the authors' main goal is to provide evidence that increasing the batch size is quantitatively equivalent to decreasing the learning rate with the same number of training epochs in decreasing the scale of random fluctuations, but with remarkably less number of parameter updates. Moreover, an additional reduction in the number of parameter updates can be attained by increasing the learning rate and scaling <math> B \propto \epsilon </math> or even more reduction by increasing the momentum coefficient and scaling <math> B \propto \frac{1}{1-m} </math> although the later decreases the test accuracy. This has been demonstrated by several experiments on the ImageNet and CIFAR-10 datasets using ResNet-50 and Inception-ResNet-V2 architectures respectively.

== STOCHASTIC GRADIENT DESCENT AND CONVEX OPTIMIZATION ==
As mentioned in the previous section, the drawback of SGD when compared to full-batch training is the noise that it introduces that hinders optimization. According to (Robbins & Monro, 1951), there are two equations that govern how to reach the minimum of a convex function: (<math> \epsilon_i </math> denotes the learning rate at the <math> i^{th} </math> gradient update)

<math> \sum_{i=1}^{\infty} \epsilon_i = \infty </math>. This equation guarantees that we will reach the minimum

<math> \sum_{i=1}^{\infty} \epsilon^2_i < \infty </math>. This equation, which is valid only for a fixed batch size, guarantees that learning rate decays fast enough allowing us to reach the minimum rather than bouncing due to noise.

These equations indicate that the learning rate must decay during training, and second equation is only available when the batch size is constant. To change the batch size, Smith and Le (2017) proposed to interpret SGD as integrating this stochastic differential equation <math> \frac{dw}{dt} = -\frac{dC}{dw} + \eta(t) </math>, where C represents cost function, w represents the parameters, and η represents the Gaussian random noise. Furthermore, they proved that noise scale g controls the magnitude of random fluctuations in the training dynamics by this formula: <math> g = \epsilon (\frac{N}{B}-1) </math>, where <math> \epsilon </math> is the learning rate, N is the training set size and <math>B</math> is the batch size. As we usually have <math> B \ll N </math>, we can define <math> g \approx \epsilon \frac{N}{B} </math>. This explains why when the learning rate decreases, noise <math>g</math> decreases, enabling us to converge to the minimum of the cost function. However, increasing the batch size has the same effect and makes <math>g</math> decays with constant learning rate. In this work, the batch size is increased until <math> B \approx \frac{N}{10} </math>, then the conventional way of decaying the learning rate is followed.

== SIMULATED ANNEALING AND THE GENERALIZATION GAP ==
'''Simulated Annealing:''' Introducing random noise or fluctuations whose scale falls during training.

'''Generalization Gap:''' Small batch data generalizes better to the test set than large batch data.

Smith and Le (2017) found that there is an optimal batch size which corresponds to optimal noise scale g <math> (g \approx \epsilon \frac{N}{B}) </math> and concluded that <math> B_{opt} \propto \epsilon N </math> that corresponds to maximum test set accuracy. This means that gradient noise is helpful as it makes SGD escape sharp minima, which does not generalize well.

Simulated Annealing is a famous technique in non-convex optimization. Starting with noise in the training process helps us to discover a wide range of parameters then once we are near the optimum value, noise is reduced to fine tune our final parameters. However, more and more researches like to use the sharper decay schedules like cosine decay or step-function drops. In physical sciences, slowly annealing (or decaying) the temperature (which is the noise scale in this situation) helps to converge to the global minimum, which is sharp. But decaying the temperature in discrete steps can make the system stuck in a local minimum, which lead to higher cost and lower curvature. The authors think that deep learning has the same intuition.
.

== THE EFFECTIVE LEARNING RATE AND THE ACCUMULATION VARIABLE ==
'''The Effective Learning Rate''' : <math> \epsilon_{eff} = \frac{\epsilon}{1-m} </math>

Smith and Le (2017) included momentum to the equation of the vanilla SGD noise scale that was defined above to be: <math> g = \frac{\epsilon}{1-m}(\frac{N}{B}-1)\approx \frac{\epsilon N}{B(1-m)} </math>, which is the same as the previous equation when m goes to 0. They found that increasing the learning rate and momentum coefficient and scaling <math> B \propto \frac{\epsilon }{1-m} </math> reduces the number of parameter updates, but the test accuracy decreases when the momentum coefficient is increased.

To understand the reasons behind this, we need to analyze momentum update equations below:

<center><math>
\Delta A = -(1-m)A + \frac{d\widehat{C}}{dw}
</math>

<math>
\Delta w = -A\epsilon
</math>
</center>

We can see that the Accumulation variable A, which is initially set to 0, then increases exponentially to reach its steady state value during <math> \frac{B}{N(1-m)} </math> training epochs while <math> \Delta w </math> is suppressed that can reduce the rate of convergence. Moreover, at high momentum, we have three challenges:

1- Additional epochs are needed to catch up with the accumulation.

2- Accumulation needs more time <math> \frac{B}{N(1-m)} </math> to forget old gradients.

3- After this time, however, the accumulation cannot adapt to changes in the loss landscape.

4- In the early stage, large batch size will lead to the instabilities.

== EXPERIMENTS ==
=== SIMULATED ANNEALING IN A WIDE RESNET ===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Schedules used as in the below figure:'''

- Decaying learning rate: learning rate decays by a factor of 5 at a sequence of “steps”, and the batch size is constant

- Increasing batch size: learning rate is constant, and the batch size is increased by a factor of 5 at every step.

- Hybrid: At the beginning, the learning rate is constant and batch size is increased by a factor of 5. Then, the learning rate decays by a factor of 5 at each subsequent step, and the batch size is constant. This is the schedule that will be used if there is a hardware limit affecting a maximum batch size limit.

[[File:Paper_40_Fig_1.png | 800px|center]]

As shown in the below figure: in the left figure (2a), we can observe that for the training set, the three learning curves are exactly the same while in figure 2b, increasing the batch size has a huge advantage of reducing the number of parameter updates.
This concludes that noise scale is the one that needs to be decayed and not the learning rate itself
[[File:Paper_40_Fig_2.png | 800px|center]]

To make sure that these results are the same for the test set as well, in figure 3, we can see that the three learning curves are exactly the same for SGD with momentum, and Nesterov momentum
[[File:Paper_40_Fig_3.png | 800px|center]]

To check for other optimizers as well. the below figure shows the same experiment as in figure 3, which is the three learning curves for test set, but for vanilla SGD and Adam, and showing
[[File:Paper_40_Fig_4.png | 800px|center]]

'''Conclusion:''' Decreasing the learning rate and increasing the batch size during training are equivalent

=== INCREASING THE EFFECTIVE LEARNING RATE===

'''Dataset:''' CIFAR-10 (50,000 training images)

'''Network Architecture:''' “16-4” wide ResNet

'''Training Parameters:''' Optimization Algorithm: SGD with momentum / Maximum batch size = 5120

'''Training Schedules:'''

The authors consider four training schedules, all of which decay the noise scale by a factor of five in a series of three steps with the same number of epochs.

Original training schedule: initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128.

Increasing batch size: learning rate of 0.1, momentum coefficient of 0.9, initial batch size of 128 that increases by a factor of 5 at each step.

Increased initial learning rate: initial learning rate of 0.5, initial batch size of 640 that increase during training.

Increased momentum coefficient: increased initial learning rate of 0.5, initial batch size of 3200 that increase during training, and an increased momentum coefficient of 0.98.

The results of all training schedules, which are presented in the below figure, are documented in the following table:

[[File:Paper_40_Table_1.png | 800px|center]]

[[File:Paper_40_Fig_5.png | 800px|center]]

'''Conclusion:''' Increasing the effective learning rate and scaling the batch size results in further reduction in the number of parameter updates

=== TRAINING IMAGENET IN 2500 PARAMETER UPDATES===

'''A) Experiment Goal:''' Control Batch Size

'''Dataset:''' ImageNet (1.28 million training images)

The paper modified the setup of Goyal et al. (2017), and used the following configuration:

'''Network Architecture:''' Inception-ResNet-V2

'''Training Parameters:'''

90 epochs / noise decayed at epoch 30, 60, and 80 by a factor of 10 / Initial ghost batch size = 32 / Learning rate = 3 / momentum coefficient = 0.9 / Initial batch size = 8192

Two training schedules were used:

“Decaying learning rate”, where batch size is fixed and the learning rate is decayed

“Increasing batch size”, where batch size is increased to 81920 then the learning rate is decayed at two steps.

[[File:Paper_40_Table_2.png | 800px|center]]

[[File:Paper_40_Fig_6.png | 800px|center]]

'''Conclusion:''' Increasing the batch size resulted in reducing the number of parameter updates from 14,000 to 6,000.

'''B) Experiment Goal:''' Control Batch Size and Momentum Coefficient

'''Training Parameters:''' Ghost batch size = 64 / noise decayed at epoch 30, 60, and 80 by a factor of 10.

The below table shows the number of parameter updates and accuracy for different set of training parameters:

[[File:Paper_40_Table_3.png | 800px|center]]

[[File:Paper_40_Fig_7.png | 800px|center]]

'''Conclusion:''' Increasing the momentum reduces the number of parameter updates, but leads to a drop in the test accuracy.

=== TRAINING IMAGENET IN 30 MINUTES===

'''Dataset:''' ImageNet (Already introduced in the previous section)

'''Network Architecture:''' ResNet-50

The paper replicated the setup of Goyal et al. (2017) while modifying the number of TPU devices, batch size, learning rate, and then calculating the time to complete 90 epochs, and measuring the accuracy, and performed the following experiments below:

[[File:Paper_40_Table_4.png | 800px|center]]

'''Conclusion:''' Model training times can be reduced by increasing the batch size during training.

== RELATED WORK ==
Main related work mentioned in the paper is as follows:

- Smith & Le (2017) interpreted Stochastic gradient descent as stochastic differential equation, which the paper built on this idea to include decaying learning rate.

- Mandt et al. (2017) analyzed how to modify SGD for the task of Bayesian posterior sampling.

- Keskar et al. (2016) focused on the analysis of noise once the training is started.

- Moreover, the proportional relationship between batch size and learning rate was first discovered by Goyal et al. (2017) and successfully trained ResNet-50 on ImageNet in one hour after discovering the proportionality relationship between batch size and learning rate.

- Furthermore, You et al. (2017a) presented Layer-wise Adaptive Rate Scaling (LARS), which is appling different learning rates to train ImageNet in 14 minutes and 74.9% accuracy.

- Finally, another strategy called Asynchronous-SGD that allowed (Recht et al., 2011; Dean et al., 2012) to use multiple GPUs even with small batch sizes.

== CONCLUSIONS ==
Increasing batch size during training has the same benefits of decaying the learning rate in addition to reducing the number of parameter updates, which corresponds to faster training time. Experiments were performed on different image datasets and various optimizers with different training schedules to prove this result. The paper proposed to increase increase the learning rate and momentum parameter <math>m</math>, while scaling <math> B \propto \frac{\epsilon}{1-m} </math>, which achieves fewer parameter updates, but slightly less test set accuracy as mentioned in details in the experiments’ section. In summary, on ImageNet dataset, Inception-ResNet-V2 achieved 77% validation accuracy in under 2500 parameter updates, and ResNet-50 achieved 76.1% validation set accuracy on TPU in less than 30 minutes. One of the great findings of this paper is that literature parameters were used, and no hyper parameter tuning was needed.

== CRITIQUE ==
'''Pros:'''

- The paper showed empirically that increasing batch size and decaying learning rate are equivalent.

- Several experiments were performed on different optimizers such as SGD and Adam.

- Had several comparisons with previous experimental setups.

'''Cons:'''

- All datasets used are image datasets. Other experiments should have been done on datasets from different domains to ensure generalization.

- The number of parameter updates was used as a comparison criterion, but wall-clock times could have provided additional measurable judgment although they depend on the hardware used.

- Special hardware is needed for large batch training, which is not always feasible. As batch-size increases, we generally need more RAM to train the same model. However, if learning rate is decreased, the RAM use remains constant. As a result, learning rate decay will allow us to train bigger models.

- In section 5.2 (Increasing the Effective Learning rate), the authors did not test a range of learning rate values and used only (0.1 and 0.5). Additional results from varying the initial learning rate values from 0.1 to 3.2 are provided in the appendix, which indicates that the test accuracy begins to fall for initial learning rates greater than ~0.4. The appended results do not show validation set accuracy curves like in Figure 6, however. It would be beneficial to see if they were similar to the original 0.1 and 0.5 initial learning rate baselines.

- Although the main idea of the paper is interesting, its results does not seem to be too surprising in comparison with other recent papers in the subject.

- The paper could benefit from using some other models to demonstrate its claim and generalize its idea by adding some comparisons with other models as well as other recent methods to increase batch size.

- The paper presents interesting ideas. However, it lacks of mathematical and theoretical analysis beyond the idea. Since the experiment is primary on image dataset and it does not provide sufficient theories, the paper itself presents limited applicability to other types.

- Also, in experimental setting, only single training runs from one random initialization is used. It would be better to take the best of many runs or to show confidence intervals.

- It is proposed that we should compare learning rate decay with batch-size increase under the setting that total budget / number of training samples is fixed.

== REFERENCES ==
- Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

- Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.arXiv preprint arXiv:1612.05086, 2016.

- L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.arXiv preprint arXiv:1606.04838, 2016.

- Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.

- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.

- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.

- Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

- Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

- Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

- Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.

- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. ACM, 2017.

- Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

- Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2017.

- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.

- Stephan Mandt, Matthew D Hoffman, and DavidMBlei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.

- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pp. 2408–2417, 2015.

- Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983.

- Lutz Prechelt. Early stopping-but when? Neural Networks: Tricks of the trade, pp. 553–553, 1998.

- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701, 2011.

- Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.

- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.

- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

- Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017a.

- Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017b.

- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

learn what not to learn

2018-11-25T23:09:09Z

Vrajendr: /* Conclusion */

=Introduction=
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states.

The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:AEF_zork_interface.png]]

All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.

=Related Work=
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.

Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following
definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise.

==Advantages of Action Elimination==
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

==Action elimination with contextual bandits==

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:AEF_architecture.png]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:AEF_pseudocode.png]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiment=
==Zork domain==
The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[File:AEF_open_zork_comparison.png]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.

They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.

=Reference=

learn what not to learn

2018-11-25T23:08:23Z

Vrajendr: /* Conclusion */

=Introduction=
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states.

The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:AEF_zork_interface.png]]

All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.

=Related Work=
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.

Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following
definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise.

==Advantages of Action Elimination==
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

==Action elimination with contextual bandits==

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:AEF_architecture.png]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:AEF_pseudocode.png]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiment=
==Zork domain==
The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[File:AEF_open_zork_comparison.png]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.

=Reference=

learn what not to learn

2018-11-25T23:07:55Z

Vrajendr: /* Conclusion */

=Introduction=
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states.

The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:AEF_zork_interface.png]]

All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.

=Related Work=
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.

Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following
definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise.

==Advantages of Action Elimination==
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

==Action elimination with contextual bandits==

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:AEF_architecture.png]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:AEF_pseudocode.png]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiment=
==Zork domain==
The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[File:AEF_open_zork_comparison.png]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

For future work the authors aim to tackle learning shared representations for elimination and control which may boost performance on both tasks.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.

=Reference=

learn what not to learn

2018-11-25T23:06:41Z

Vrajendr: /* Conclusion */

=Introduction=
In reinforcement learning, it is often difficult for agent to learn when the action space is large. For a specific case that many actions are irrelevant, it is sometimes easier for the algorithm to learn which action not to take. The paper propose a new reinforcement learning approach for dealing with large action spaces by restricting the available actions in each state to a subset of the most likely ones. More specifically, it propose a system that learns the approximation of Q-function and concurrently learns to eliminate actions. The method need to utilize an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played. (e.g., Player: "Climb the tree." Parser: "There are no trees to climb") Then a machine learning model can be trained to generalize to unseen states.

The paper focus mainly on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a DQN network and an Action Elimination Network(AEN), both using the CNN for NLP tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which let player to interact with a virtual world through a text based interface, is tested by using the elimination framework. The AE algorithm has achieved faster learning rate than the baseline agents through eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:AEF_zork_interface.png]]

All state and action are given in natural language. Input for the game contains more than a thousand possible actions in each state since player can type anything.

=Related Work=
Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction and common sense. It also often introduce stochastic dynamics to increase randomness.

Representations for TBG: Good word representation is necessary in order to learn control policies from texts. Previous work on TBG used pre-trained embeddings directly for control. other works combined pre-trained embedding with neural networks.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

After executing an action, the agent observes a binary elimination signal e(s, a) to determine which actions not to take. It equals 1
if action a may be eliminated in state s (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following
definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

The approach in the paper builds on the standard RL formulation. At each time step t, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and next state <math display="inline">s_{t+1} </math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discount return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]. </math>After executing an action, the agent observes a binary elimination signal e(s,a), which equals to 1 if action a can be eliminated for state s, 0 otherwise.

==Advantages of Action Elimination==
The main advantages of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigate this effect by taking the max operator only on valid actions, thus, reducing potential overestimation. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping leading to faster convergence.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

==Action elimination with contextual bandits==

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in R_d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^Tx(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)} \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2log(det(\bar{V}_{t,a}^{1/2})det(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s \Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{dlog(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t,a}^{T}x(s_t)-\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=
The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> might not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as features. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:AEF_architecture.png]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent use it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:AEF_pseudocode.png]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiment=
==Zork domain==
The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===
The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents has performed well on sets a and c. However the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===
The goal of this quest is to find the troll. To do it the agent need to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential, and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===
Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps has been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[File:AEF_open_zork_comparison.png]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the most famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.

=Reference=

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-25T22:59:17Z

Vrajendr: /* Conclusion and Discussion */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors proposed a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series, and UCI household electricity consumption dataset. This paper focused on time series with multivariate and noisy signals, especially the financial data. Financial time series are challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, same signal (e.g. price of stock) is obtained from different sources (e.g. financial news, investment bank, financial analyst etc.) in asynchronous moment of time. Each source has different different bias and noise.(Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that changes in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships.

The time series forecasting problem can be expressed as a conditional probability distribution below, we focused on modeling the predictors of future values of time series given their past:
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
The predictability of financial dataset still remains an open problem and is discussed in various publications. ([2])

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same CDS2 throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still are little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. This composition of functions may lead to popular recurrent architecture such as LSTM and GRU[11].

The idea of the gating system is aimed to weight outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x), p(x)=softmax(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs(composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the method is similar as this. The difference is that modelling the functions as multi-layer CNNs. Another difference is that not using recurrent layers, which can enable the network to remember the parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations they stated in the paper:
#The forecasting problem in this paper has done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics are more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#Although Gaussian processes provide useful theoretical framework that is able to handle asynchronous data, they often follow heavy-tailed distribution for financial datasets.
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the durations, but this approach may lead to imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there's a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | {x_{n-m}, m=1,2,...}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
This is summation of the columns of the matrix in bracket, where
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. S is a fully convolutional network which is composed of convolutional layers only. <math>F</math> is in the form of
<math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [off(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> where <math> W \in \mathbb{R}^{d_I \times M}</math> and <math> off: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T,...,a_{d_I}^T)^T)=(\sigma(a_1)^T,...\sigma(a_{d_I})^T)^T </math>
# <math>\otimes</math> is element-wise matrix multiplication.
#<math>A.,_m</math> denotes the m-th column of a matrix A, and <math>\sum_{m=1}^M A.,_m=A(1,1,...,1)^T</math>.
Since <math>\sum_{m=1}^M W.,_m=W(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>off</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\hat{y}_n</math> forced to separate the temporal dependence (obtained in weights <math>W_m</math>). S is determined by its filters which capture local dependencies and are independent of the relative position in time, the predictors <math>off(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals, and might be heterogenous, therefore adjustment of offset network is important.In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificial generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from evaluation of the SOCNN architecture the paper also discusses the impact of network components such as: such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 400px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer, because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization in between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 600px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 400px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series, and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were splitted is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-25T22:51:36Z

Vrajendr: /* Experiments */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors proposed a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive(AR) models and gating systems used in recurrent neural networks and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series, and UCI household electricity consumption dataset. This paper focused on time series with multivariate and noisy signals, especially the financial data. Financial time series are challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, same signal (e.g. price of stock) is obtained from different sources (e.g. financial news, investment bank, financial analyst etc.) in asynchronous moment of time. Each source has different different bias and noise.(Figure 1) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, then the significance of each past observations may depend on other factors that changes in time. Therefore, the traditional econometric models such as AR, VAR, VARMA[1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships.

The time series forecasting problem can be expressed as a conditional probability distribution below, we focused on modeling the predictors of future values of time series given their past:
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
The predictability of financial dataset still remains an open problem and is discussed in various publications. ([2])

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same CDS2 throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still are little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. This composition of functions may lead to popular recurrent architecture such as LSTM and GRU[11].

The idea of the gating system is aimed to weight outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x), p(x)=softmax(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs(composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the method is similar as this. The difference is that modelling the functions as multi-layer CNNs. Another difference is that not using recurrent layers, which can enable the network to remember the parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations they stated in the paper:
#The forecasting problem in this paper has done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics are more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#Although Gaussian processes provide useful theoretical framework that is able to handle asynchronous data, they often follow heavy-tailed distribution for financial datasets.
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the durations, but this approach may lead to imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there's a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | {x_{n-m}, m=1,2,...}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.
Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>. The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
This is summation of the columns of the matrix in bracket, where
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks. S is a fully convolutional network which is composed of convolutional layers only. <math>F</math> is in the form of
<math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [off(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math> where <math> W \in \mathbb{R}^{d_I \times M}</math> and <math> off: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.
#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T,...,a_{d_I}^T)^T)=(\sigma(a_1)^T,...\sigma(a_{d_I})^T)^T </math>
# <math>\otimes</math> is element-wise matrix multiplication.
#<math>A.,_m</math> denotes the m-th column of a matrix A, and <math>\sum_{m=1}^M A.,_m=A(1,1,...,1)^T</math>.
Since <math>\sum_{m=1}^M W.,_m=W(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>off</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\hat{y}_n</math> forced to separate the temporal dependence (obtained in weights <math>W_m</math>). S is determined by its filters which capture local dependencies and are independent of the relative position in time, the predictors <math>off(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals, and might be heterogenous, therefore adjustment of offset network is important.In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificial generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes sent by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from evaluation of the SOCNN architecture the paper also discusses the impact of network components such as: such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 400px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer, because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization in between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 600px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 400px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series, and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were splitted is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- tional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learn- ing in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Acceler- ating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Em- pirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the dif- ficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

policy optimization with demonstrations

2018-11-25T22:48:00Z

Vrajendr: /* Problem Definition */

= Introduction =

The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.

==Intuition==
The agent should imitate the demonstrated behavior when rewards are sparse and then explore new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL. At present the state of the art exploration in Reinforcement learning is simply epsilon greedy which just makes random moves for a small percentage of times to explore unexplored moves. This is very naive and is one of the main reasons for the high sample complexity in RL. On the other hand, if there is an expert demonstrator who can guide exploration, the agent can make more guided and accurate exploratory moves.

=Related Work =
There are some related works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.

For learning from demonstration (LfD),
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.

For imitation learning,
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.

Both of the above methods are effective for imitation learning, but cannot leverage the valuable feedback given by the environments and usually suffer from bad performance when the expert data is imperfect. That is different from POfD in this paper.

There is also another idea in which an agent learns using hybrid imitation learning and reinforcement learning reward[23, 24]. However, unlike this paper, they did not provide some theoretical support for their method and only explained some intuitive explanations.

=Background=

==Preliminaries==
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>:
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.
[[File:def1.png|500px|center]]
Then the performance of <math> \pi </math> can be rewritten to:
[[File:equ2.png|500px|center]]
At the same time, the authors propose a lemma:
[[File:lemma1.png|500px|center]]

==Problem Definition==
Generally, RL tasks and environments do not provide a comprehensive reward and instead rely on sparse feedback indicating whether the goal is reached.

In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where the i-th trajectory <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:
[[File:asp1.png|500px|center]]
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages.

=Method=

==Policy Optimization with Demonstration (POfD)==
[[File:ff1.png|500px|center]]
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]

==Benefits of Exploration with Demonstrations==
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,
[[File:them1.png|500px|center]]

In fact, POfD brings another factor, <math>D_{J S}^{max}(\pi_{i}, \pi_{E})</math>, that would fully use the advantage <math>{\hat \delta}</math>and add improvements with a margin over pure policy gradient methods.

==Optimization==

For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：
[[File:them2.png|500px|center]]
Thus, the occupancy measure matching objective can be written as:
[[File:eqnlm.png|500px|center]]
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is:
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.

At the end, Algorithm 1 gives the detailed process.
[[File:pofd.png|500px|center]]

=Discussion on Existing LfD Methods=

==DQFD==
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.

==DDPGfD==
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.

=Experiments=

==Goal==
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare.

==Settings==
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment.
[[File:pofdt1.png|900px|center]]

==Baselines==
The authors compare POfD against 5 strong baselines:
* training the policy with TRPO [17] in dense environments, which is called expert
* training the policy with TRPO [17] in sparse environments
* applying GAIL [14] to learn the policy from demonstrations
* DQfD [2]
* DDPGfD [3]

==Results==
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].

[[File:pofdf2.png|500px|center]]

Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.
[[File:pofdf3.png|900px|center]]

The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.

The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.

=Conclusion=
In this paper, a method, POfD, is proposed that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learn better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.

=Critique=
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.

=References=
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.

[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.

[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.

[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.

[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.

[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.

[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.

[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.

[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.

[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.

[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.

[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.

[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.

[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.

[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.

[23] Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramar, J., Hadsell, R., de Freitas, N., et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.

[24] Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.

policy optimization with demonstrations

2018-11-25T22:46:17Z

Vrajendr: /* Related Work */

= Introduction =

The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.

==Intuition==
The agent should imitate the demonstrated behavior when rewards are sparse and then explore new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL. At present the state of the art exploration in Reinforcement learning is simply epsilon greedy which just makes random moves for a small percentage of times to explore unexplored moves. This is very naive and is one of the main reasons for the high sample complexity in RL. On the other hand, if there is an expert demonstrator who can guide exploration, the agent can make more guided and accurate exploratory moves.

=Related Work =
There are some related works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.

For learning from demonstration (LfD),
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.

For imitation learning,
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.

Both of the above methods are effective for imitation learning, but cannot leverage the valuable feedback given by the environments and usually suffer from bad performance when the expert data is imperfect. That is different from POfD in this paper.

There is also another idea in which an agent learns using hybrid imitation learning and reinforcement learning reward[23, 24]. However, unlike this paper, they did not provide some theoretical support for their method and only explained some intuitive explanations.

=Background=

==Preliminaries==
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>:
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.
[[File:def1.png|500px|center]]
Then the performance of <math> \pi </math> can be rewritten to:
[[File:equ2.png|500px|center]]
At the same time, the authors propose a lemma:
[[File:lemma1.png|500px|center]]

==Problem Definition==
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where the i-th trajectory <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:
[[File:asp1.png|500px|center]]
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages.

=Method=

==Policy Optimization with Demonstration (POfD)==
[[File:ff1.png|500px|center]]
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]

==Benefits of Exploration with Demonstrations==
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,
[[File:them1.png|500px|center]]

In fact, POfD brings another factor, <math>D_{J S}^{max}(\pi_{i}, \pi_{E})</math>, that would fully use the advantage <math>{\hat \delta}</math>and add improvements with a margin over pure policy gradient methods.

==Optimization==

For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：
[[File:them2.png|500px|center]]
Thus, the occupancy measure matching objective can be written as:
[[File:eqnlm.png|500px|center]]
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is:
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.

At the end, Algorithm 1 gives the detailed process.
[[File:pofd.png|500px|center]]

=Discussion on Existing LfD Methods=

==DQFD==
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.

==DDPGfD==
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.

=Experiments=

==Goal==
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare.

==Settings==
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment.
[[File:pofdt1.png|900px|center]]

==Baselines==
The authors compare POfD against 5 strong baselines:
* training the policy with TRPO [17] in dense environments, which is called expert
* training the policy with TRPO [17] in sparse environments
* applying GAIL [14] to learn the policy from demonstrations
* DQfD [2]
* DDPGfD [3]

==Results==
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].

[[File:pofdf2.png|500px|center]]

Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.
[[File:pofdf3.png|900px|center]]

The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.

The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.

=Conclusion=
In this paper, a method, POfD, is proposed that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learn better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.

=Critique=
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.

=References=
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.

[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.

[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.

[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.

[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.

[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.

[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.

[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.

[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.

[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.

[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.

[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.

[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.

[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.

[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.

[23] Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramar, J., Hadsell, R., de Freitas, N., et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.

[24] Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.

policy optimization with demonstrations

2018-11-25T22:40:01Z

Vrajendr: /* Introduction */

= Introduction =

The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policy to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore the state that has never been seen. 2) Guide the agent to imitate the demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method based on demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. In this paper, the authors evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.

==Intuition==
The agent should imitate the demonstrated behavior when rewards are sparse and then explore new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshape in terms of the native rewards in RL. At present the state of the art exploration in Reinforcement learning is simply epsilon greedy which just makes random moves for a small percentage of times to explore unexplored moves. This is very naive and is one of the main reasons for the high sample complexity in RL. On the other hand, if there is an expert demonstrator who can guide exploration, the agent can make more guided and accurate exploratory moves.

=Related Work =
There are some related works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.

For learning from demonstration (LfD),
# Most LfD methods adopt value-based RL algorithms, such as DQfD [2] that is applied into the discrete action spaces and DDPGfD [3] that is extends to the continuous spaces. But both of them underutilize the demonstration data.
#There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.

For imitation learning,
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.
Both of the above methods are effective for imitation learning, but they usually suffer the bad performance when the expert data is imperfect. That is different from POfD in this paper.

There is also another idea in which an agent learns using hybrid imitation learning and reinforcement learning reward[23, 24]. However, unlike this paper, they did not provide some theoretical support for their method and only explained some intuitive explanations.

=Background=

==Preliminaries==
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨S, A, P, r, \gamma⟩ </math>, where <math>S</math> is the state, <math>A </math> is the action, <math>P(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is discounted factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>:
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.
[[File:def1.png|500px|center]]
Then the performance of <math> \pi </math> can be rewritten to:
[[File:equ2.png|500px|center]]
At the same time, the authors propose a lemma:
[[File:lemma1.png|500px|center]]

==Problem Definition==
In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where the i-th trajectory <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the expert policy. In addition, there is an assumption on the quality of the expert policy:
[[File:asp1.png|500px|center]]
Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. It is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages.

=Method=

==Policy Optimization with Demonstration (POfD)==
[[File:ff1.png|500px|center]]
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]

==Benefits of Exploration with Demonstrations==
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]
<math>\eta(\pi)</math>is the advantage over the policy πold in the previous iteration, so the expression can be rewritten by
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. For POfD, it imposes a regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,
[[File:them1.png|500px|center]]

In fact, POfD brings another factor, <math>D_{J S}^{max}(\pi_{i}, \pi_{E})</math>, that would fully use the advantage <math>{\hat \delta}</math>and add improvements with a margin over pure policy gradient methods.

==Optimization==

For POfD, the authors choose to optimize the lower bound of learning objective rather than optimizing objective. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：
[[File:them2.png|500px|center]]
Thus, the occupancy measure matching objective can be written as:
[[File:eqnlm.png|500px|center]]
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: S\times A \rightarrow (0,1)</math>, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is:
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]
At this point, the problem has been like Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.

At the end, Algorithm 1 gives the detailed process.
[[File:pofd.png|500px|center]]

=Discussion on Existing LfD Methods=

==DQFD==
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.

==DDPGfD==
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.

=Experiments=

==Goal==
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare.

==Settings==
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco [5]. Due to the uniqueness of the environments, the authors introduce 4 ways to sparsity their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisely 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO in the corresponding dense environment.
[[File:pofdt1.png|900px|center]]

==Baselines==
The authors compare POfD against 5 strong baselines:
* training the policy with TRPO [17] in dense environments, which is called expert
* training the policy with TRPO [17] in sparse environments
* applying GAIL [14] to learn the policy from demonstrations
* DQfD [2]
* DDPGfD [3]

==Results==
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converes to the imperfect demonstration in MountainCar [22].

[[File:pofdf2.png|500px|center]]

Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of cumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.
[[File:pofdf3.png|900px|center]]

The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.

The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.

=Conclusion=
In this paper, a method, POfD, is proposed that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivity of POfD in encouraging the agent to explore around the nearby region of the expert policy and learn better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.

=Critique=
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behavior when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.

=References=
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.

[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.

[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.

[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.

[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.

[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.

[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.

[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.

[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.

[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.

[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.

[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.

[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.

[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.

[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.

[23] Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramar, J., Hadsell, R., de Freitas, N., et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.

[24] Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.

Visual Reinforcement Learning with Imagined Goals

2018-11-25T22:38:57Z

Vrajendr: /* Algorithm */

Video and details of this work is available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some scholars proposed time-varying models which require episodic setups. There are also scholars that propose an approach that uses goal images, but it requires instrumented training simulations. There is no example that uses model-free RL that learns policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabeling, which improves sample efficiency. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of RL. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy, that when given a state and goal, can dictate the optimal action. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q function Q(s,a,g) tells us how good an action a is, given the current state s and goal g. For example, a Q function tells us, “How good is it to move my hand up (action a), if I’m holding a plate (state s) and want to put the plate on the table (goal g)?” Once this Q function is trained, a goal-conditioned policy can be obtained by performing the following optimization

[[File:policy-extraction.png|center|600px]]

which effectively says, “choose the best action according to this Q function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q learning is popular is that in can be train in an off-policy manner. Therefore, the only things Q function needs are samples of state, action, next state, goal, and reward: (s,a,s′,g,r). This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling are usually used to get state-action-next-state data, (s,a,s′). However, if the reward function r(s,g) can be accessed, one can retroactively relabeled goals and recompute rewards. In this way, more data can be artificially generated given a single (s,a,s′) tuple. So, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution p(g). When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. Images are noisy. A large amount of information in an image that may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution p(g) is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder (VAE)=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations x, like images, into low-dimensional latent variables z, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image x and goal image xg can be converted into latent variables z and zg, respectively. These latent variables can then be used to represent ate the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (x) and goal image (xg) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal zg.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>zg</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

The data is first collected via a simple exploration policy and then train a VAE latent variable model on state observations and then fine tune over the course of training. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space.

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose that source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contain in a image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

File:algorithm1.png

2018-11-25T22:34:10Z

Vrajendr: Vrajendr uploaded a new version of File:algorithm1.png

Visual Reinforcement Learning with Imagined Goals

2018-11-25T22:33:15Z

Vrajendr:

Video and details of this work is available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some scholars proposed time-varying models which require episodic setups. There are also scholars that propose an approach that uses goal images, but it requires instrumented training simulations. There is no example that uses model-free RL that learns policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabeling, which improves sample efficiency. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of RL. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy, that when given a state and goal, can dictate the optimal action. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q function Q(s,a,g) tells us how good an action a is, given the current state s and goal g. For example, a Q function tells us, “How good is it to move my hand up (action a), if I’m holding a plate (state s) and want to put the plate on the table (goal g)?” Once this Q function is trained, a goal-conditioned policy can be obtained by performing the following optimization

[[File:policy-extraction.png|center|600px]]

which effectively says, “choose the best action according to this Q function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q learning is popular is that in can be train in an off-policy manner. Therefore, the only things Q function needs are samples of state, action, next state, goal, and reward: (s,a,s′,g,r). This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling are usually used to get state-action-next-state data, (s,a,s′). However, if the reward function r(s,g) can be accessed, one can retroactively relabeled goals and recompute rewards. In this way, more data can be artificially generated given a single (s,a,s′) tuple. So, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution p(g). When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. Images are noisy. A large amount of information in an image that may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution p(g) is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder (VAE)=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations x, like images, into low-dimensional latent variables z, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image x and goal image xg can be converted into latent variables z and zg, respectively. These latent variables can then be used to represent ate the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (x) and goal image (xg) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal zg.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>zg</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose that source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contain in a image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

Visual Reinforcement Learning with Imagined Goals

2018-11-25T22:29:12Z

Vrajendr: /* Related Work */

Video and details of this work is available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some scholars proposed time-varying models which require episodic setups. There are also scholars that propose an approach that uses goal images, but it requires instrumented training simulations. There is no example that uses model-free RL that learns policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabeling, which improves sample efficiency. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of RL. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy, that when given a state and goal, can dictate the optimal action. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q function Q(s,a,g) tells us how good an action a is, given the current state s and goal g. For example, a Q function tells us, “How good is it to move my hand up (action a), if I’m holding a plate (state s) and want to put the plate on the table (goal g)?” Once this Q function is trained, a goal-conditioned policy can be obtained by performing the following optimization

[[File:policy-extraction.png|center|600px]]

which effectively says, “choose the best action according to this Q function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q learning is popular is that in can be train in an off-policy manner. Therefore, the only things Q function needs are samples of state, action, next state, goal, and reward: (s,a,s′,g,r). This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling are usually used to get state-action-next-state data, (s,a,s′). However, if the reward function r(s,g) can be accessed, one can retroactively relabeled goals and recompute rewards. In this way, more data can be artificially generated given a single (s,a,s′) tuple. So, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution p(g). When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. Images are noisy. A large amount of information in an image that may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution p(g) is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder (VAE)=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations x, like images, into low-dimensional latent variables z, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image x and goal image xg can be converted into latent variables z and zg, respectively. These latent variables can then be used to represent ate the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (x) and goal image (xg) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal zg.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>zg</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose that source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contain in a image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

Unsupervised Neural Machine Translation

2018-11-25T22:23:28Z

Vrajendr: /* Experiments and Results */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which is similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and model the generation process of the ciphertext as a two-stage process including the generation of the original English sequence and the probabilistic replacement of the words in it. This approach is able to take the advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.

Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units while the dimensionality of the embeddings is set to 300. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an annonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-25T22:20:45Z

Vrajendr: /* Experiments and Results */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which is similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and model the generation process of the ciphertext as a two-stage process including the generation of the original English sequence and the probabilistic replacement of the words in it. This approach is able to take the advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.

Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units while the dimensionality of the embeddings is set to 300. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an annonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-25T22:16:16Z

Vrajendr: /* LOW-RESOURCE NEURAL MACHINE TRANSLATION */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which is similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and model the generation process of the ciphertext as a two-stage process including the generation of the original English sequence and the probabilistic replacement of the words in it. This approach is able to take the advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.

Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units while the dimensionality of the embeddings is set to 300. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an annonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-25T22:15:46Z

Vrajendr: /* STATISTICAL DECIPHERMENT FOR MACHINE TRANSLATION */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation(NMT) method to machine translation using only monolingual corpora without any alignment between sentences or documents. Monolingual corpora are text corpora that are made up of one language only. This contrasts with the usual Supervised NMT approach that uses parallel corpora, where two corpora are the direct translation of each other and the translations are aligned by words or sentences. This problem is important as NMT often requires large parallel corpora to achieve good results, however, in reality, there are a number of languages that lack parallel pairing, e.g. for German-Russian.

Other authors have recently tried to address this problem as well as semi-supervised approaches but these methods still require a strong cross-lingual signal. The proposed method eliminates the need for a cross-lingual information, relying solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn source and target word embeddings.
# Align the 2 sets of word embeddings in the same latent space.
Then iteratively perform:
# Train an encoder-decoder to reconstruct noisy versions of sentence embeddings for both source and target language, where the encoder is shared and the decoder is different in each language.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering technique to induce a machine translation model from monolingual data, which is similar to the noisy-channel model used by SMT(Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext and model the generation process of the ciphertext as a two-stage process including the generation of the original English sequence and the probabilistic replacement of the words in it. This approach is able to take the advantage of the incorporation of syntactic knowledge of the languages. It shows that word embeddings implementation improves statistical decipherment in machine translation.

====LOW-RESOURCE NEURAL MACHINE TRANSLATION====
There are also proposals that use techniques other than direct parallel corpora to do neural machine translation(NMT). Some use a third intermediate language that is well connected to 2 other languages that otherwise have little direct resources. For example, we want to translate German into Russian, but little direct-source for these two languages, we can use English as an intermediate language(German-English and English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well even for language pairs which have no direct data was given.

Other works use monolingual data in combination with scarce parallel corpora. Creating a synthetic parallel corpus by backtranslating a monolingual corpus in the target language is one of simple but effective approach.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start, while our paper does not use parallel data.

= Methodology =

The corpora data is first processed in a standard way to tokenize and case the words. The authors also experiment with an additional way of translation using Byte-Pair Encoding(BPE) [Sennrich, 2016], where the translation is done by sub-words instead of words. BPE is often used to improve rare-word translations. To test the effectiveness of BPE, they limited the vocabulary to the most frequent 50,000 BPE tokens.

The words or BPEs are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units while the dimensionality of the embeddings is set to 300. The encoder is shared by the source and target language, while the decoder is different by language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===

Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both
languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works to reconstruct a noisy version of the same language back to the original sentence. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>,
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are near each other, where <math>N</math> is the number of words in the sentence.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L1,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy(BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest to use larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produces by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an annonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

2018-11-22T01:33:38Z

Vrajendr: /* Proof Sketch for Theorem 1 */

The paper ''Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction'' was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as of May 2018) obtained from [https://arxiv.org/pdf/1802.05451v3.pdf arXiv]

(*) Equal contribution

=Motivation=
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance and proves state of the art performance on models that follow this principle.

The primary contributions that this paper makes include:
# Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures
# Empirically proving the benefit of graph-permutation invariance
# Developing a state-of-the-art model for scene graph predictions over a large set of complex visual scenes

=Introduction=
In order for a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A '''scene graph''' is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes and relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.

[[File:scene_graph_example.png|600px|center]]

Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.

In structured prediction models, a score function <math>s(x, y)</math> is defined to evaluate the compatibility between label <math>y</math> and input <math>x</math>. For instance, when interpreting the scene of an image, <math>x</math> refers to the image itself, and <math>y</math> refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label <math>y*</math> such that <math>s(x,y)</math> is maximized. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function.

The paper presents an alternative approach, for which input <math>x</math> is mapped to structured output <math>y</math> using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.

=Structured prediction=
This paper further considers structured predictions using score-based methods. For structured predictions that follow a score-based approach, a score function <math>s(x, y)</math> is used to measure how compatible label <math>y</math> is for input <math>x</math>. To optimize the score function, previous works have decomposed <math>s(x,y) = \sum_i f_i(x,y)</math> in order to facilitate efficient optimization for each local score function, <math>\max_y f_i(x,y)</math>.

In the area of structured predictions, the most commonly-used score functions include the singleton score function <math>f_i(y_i, x)</math> and pairwise score function <math>f_{ij} (y_i, y_j, x)</math>. Previous works explored a two-stage architectures (learn local scores independently of the structured prediction goal), and end-to-end architectures (to include the inference algorithm within the computation graph).

==Advantages of using score-based methods==
# Allow for intuitive specification of local dependencies between labels, and how they map to global dependencies
# Linear score functions offer natural convex surrogates
# Inference in large label space is sometimes possible via exact algorithms or empirically accurate approximations

The concern for modelling score functions using deep networks is that learning may no longer be convex. Hence, the paper presents properties for how deep networks can be used for structured predictions by considering architectures that do not require explicit maximization of a score function.

=Background, Notations, and Definitions=
We denote <math>y</math> as a structured label where <math>y = [y_1, \dots, y_n]</math>

'''Score functions:''' for score-based methods, the score is defined as either the sum of a set of singleton scores <math>f_i = f_i(y_i, x)</math> or the sum of pairwise scores <math>f_{ij} = f_{ij}(y_i, y_j, x)</math>.

Let <math>s(x,y)</math> be the score of a score-based method. Then:

<div align="center">
<math>s(x,y) = \begin{cases}
\sum_i f_i ~ \text{if we have a set of singleton scores}\\
\sum_{ij} f_{ij} ~ \text{if we have a set of pairwise scores } \\
\end{cases}</math>
</div>

'''Inference algorithm:''' an inference algorithm takes input set of local scores (either <math>f_i</math> or <math>f_{ij}</math>) and outputs an assignment of labels <math>y_1, \dots, y_n</math> that maximizes score function <math>s(x,y)</math>

'''Graph labeling function:''' a graph labeling function <math>\mathcal{F} : (V,E) \rightarrow Y</math> is a function that takes input of: an ordered set of node features <math>V = [z_1, \dots, z_n]</math> and an ordered set of edge features <math>E = [z_{1,2},\dots,z_{i,j},\dots,z_{n,n-1}]</math> to output set of node labels <math>\mathbf{y} = [y_1, \dots, y_n]</math>. For instance, <math>z_i</math> can be set equal to <math>f_i</math> and <math>z_{ij}</math> can be set equal to <math>f_{ij}</math>.

For convenience, the joint set of nodes and edges will be denoted as <math>\mathbf{z}</math> to be a size <math>n^2</math> vector (<math>n</math> nodes and <math>n(n-1)</math> edges).

'''Permutation:''' Let <math>z</math> be a set of node and edge features. Given a permutation <math>\sigma</math> of <math>\{1,\dots,n\}</math>, let <math>\sigma(z)</math> be a new set of node and edge features given by [<math>\sigma(z)]_i = z_{\sigma(i)}</math> and <math>[\sigma(z)]_{i,j} = z_{\sigma(i), \sigma(j)}</math>

'''One-hot representation:''' <math>\mathbf{1}[j]</math> be a one-hot vector with 1 in the <math>j^{th}</math> coordinate

=Permutation-Invariant Structured prediction=

With permutation-invariant structured prediction, we would expect the algorithm to produce the same result given the same score function. For instance, consider the case where we have label space for 3 variables <math>y_1, y_2, y_3</math> with input <math>\mathbf{z} = (f_1, f_2, f_3, f_{12}, f_{13}, f_{23})</math> that outputs label <math>\mathbf{y} = (y_1^*, y_2^*, y_3^*)</math>. Then if the algorithm is run on a permuted version input <math>z' = (f_2, f_1, f_3, f_{21}, f_{23}, f_{13})</math>, we would expect <math>\mathbf{y} = (y_2^*, y_1^*, y_3^*)</math> given the same score function.

'''Graph permutation invariance (GPI):''' a graph labeling function <math>\mathcal{F}</math> is graph-permutation invariant, if for all permutations <math>\sigma</math> of <math>\{1, \dots, n\}</math> and for all nodes <math>z</math>, <math>\mathcal{F}(\sigma(\mathbf{z})) = \sigma(\mathcal{F}(\mathbf{z}))</math>

The paper presents a theorem on the necessary and sufficient conditions for a function <math>\mathcal{F}</math> to be graph permutation invariant. Intuitively, because <math>\mathcal{F}</math> is a function that takes an ordered set <math>z</math> as input, the output on <math>\mathbf{z}</math> could very well be different from <math>\sigma(\mathbf{z})</math>, which means <math>\mathcal{F}</math> needs to have some sort of symmetry in order to sustain <math>[\mathcal{F}(\sigma(\mathbf{z}))]]_k = [\mathcal{F}(\mathbf{z})]_{\sigma(k)}</math>.

[[File:graph_permutation_invariance.jpg|400px|center]]

==Theorem 1==
Let <math>\mathcal{F}</math> be a graph labeling function. Then <math>\mathcal{F}</math> is graph-permutation invariant if and only if there exist functions <math>\alpha, \rho, \phi</math> such that for all <math>k=1, .., n</math>:
\begin{align}
[\mathcal{F}(\mathbf{z})]_k = \rho(\mathbf{z}_k, \sum_{i=1}^n \alpha(\mathbf{z}_i, \sum_{i\neq j} \phi(\mathbf{z}_i, \mathbf{z}_{i,j}, \mathbf{z}_j)))
\end{align}
where <math>\phi: \mathbb{R}^{2d+e} \rightarrow \mathbb{R}^L, \alpha: \mathbb{R}^{d + L} \rightarrow \mathbb{R}^{W}, p: \mathbb{R}^{W+d} \rightarrow \mathbb{R}</math>.

Notice that for the dimensions of inputs and outputs, <math>d</math> refers to the number of singleton features in <math>z</math> and <math>e</math> refers to the number of edges.

[[File:GPI_architecture.jpg|thumb|A schematic representation of the GPI architecture. Singleton features <math>z_i</math> are omitted for simplicity. First, the features <math>z_{i,j}</math> are processed element-wise by <math>\phi</math>. Next, they are summed to create a vector <math>s_i</math>, which is concatenated with <math>z_i</math>. Third, a representation of the entire graph is created by applying <math>\alpha\ n</math> times and summing the created vector. The graph representation is then finally processed by <math>\rho</math> together with <math>z_k</math>.|600px|center]]

==Proof Sketch for Theorem 1==
The proof of this theorem can be found in the paper. A proof sketch is provided below:

'''For the forward direction''' (function that follows the form set out in equation (1) is GPI):
# Using definition of permutation <math>\sigma</math>, and rewriting <math>[F(z)]_{\sigma(k)}</math> in the form from equation (1)
# Second argument of <math>\rho</math> is invariant under <math>\sigma</math>, since it takes the sum of all indices <math>i</math> and all other indices <math>j \neq i </math>.

'''For the backward direction''' (any black-box GPI function can be expressed in the form of equation 1):
# Construct <math>\phi, \alpha</math> such that second argument of <math>\rho</math> contains all information about graph features of <math>z</math>, including edges that the features originate from
# Assume each <math>z_k</math> uniquely identifies the node and <math>\mathcal{F}</math> is a function only of pairwise features <math>z_{i,j}</math>
# Construct <math>H</math> be a perfect hash function with <math>L</math> buckets, and <math>\phi</math> which maps '''pairwise features''' to a vector of size <math>L</math>
# <math>*</math>Construct <math>\phi(z_i, z_{i,j}, z_j) = \mathbf{1}[H(z_j)] z_{i,j}</math>, which intuitively means that <math>\phi</math> stores <math>z_{i,j}</math> in the unique bucket for node <math>j</math>
# Construct function <math>\alpha</math> to output a matrix <math>\mathbb{R}^{L \times L}</math> that maps each pairwise feature into unique positions (<math>\alpha(z_i, s_i) = \mathbf{1}[H(z_i)]s_i^T</math>)
# Construct matrix <math>M = \sum_i \alpha(z_i,s_i)</math> by discarding rows/columns in <math>M</math> that do not correspond to original nodes (which reduces dimension to <math>n\times n</math>; set <math>\rho</math> to have same outcome as <math>\mathcal{F}</math>, and set the output of <math>\mathcal{F}</math> on <math>M</math> to be the labels <math>\mathbf{y} = y_1, \dots, y_n</math>

<math>*</math>The paper presents the proof for the edge features <math>z_{ij}</math> being scalar (<math>e = 1</math>) for simplicity, which can be extended easily to vectors with additional indexing.

Although the results discussed previously apply to complete graphs (edges apply to all feature pairs), it can be easily extended to incomplete graphs. However, in place of permutation-invariance, it is now an automorphism-invariance.

==Implications and Applications of Theorem 1==
===Key Implications of Theorem 1===
# Architecture "collects" information from the different edges of the graph, and does so in an invariant fashion using <math>\alpha</math> and <math>\phi</math>
# Architecture is parallelizable, since all <math>\phi</math> functions can be applied simultaneously

===Some applications of Theorem 1===
# '''Attention:''' the concept of attention can be implemented in the GPI characterization, with slight alterations to the functions <math>\alpha</math> and <math>\phi</math>. In attention each node aggregates features of neighbours through a function of neighbour's relevance. Which means the lable of an entity could depend strongly on its close entity. The complete details can be found in the supplementary materials of the paper.

# '''RNN:''' recurrent architectures can maintain GPI property, since all GPI function <math>\mathcal{F}</math> are closed under composition. The output of one step after running <math>\mathcal{F}</math> will act as input for the next step, but maintain the GPI property throughout.

=Related Work=
# '''Architectural invariance:''' suggested recently in a 2017 paper called Deep Sets by Zaheer et al., which considers the case of invariance that is more restrictive.
# '''Deep structured prediction:''' previous work applied deep learning to structured prediction, for instance, semantic segmentation. Some algorithms include message passing algorithms, gradient descent for maximizing score functions, greedy decoding (inference of labels based on time of previous labels). Apart from those algorithms, deep learning has been applied to other graph-based problems such as the Travelling Salesman Problem (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017). However, none of the previous work specifically address the notion of invariance in the general architecture, but rather focus on message passing architectures that can be generalized by this paper.
# '''Scene graph prediction:''' scene graph extraction allows for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017). Some other works in this area include object detection, action recognition, and even detection of human-object interactions (Liao et al., 2016; Plummer et al., 2017). Additional work has been done with the use of message passing algorithms (Xu et al., 2017), word embeddings (Lu et al., 2016), and end-to-end prediction directly from pixels (Newell & Deng, 2017). A notable mention is NeuralMotif (Zellers et al., 2017), which the authors describe as the current state-of-the-art model for scene graph predictions on Visual Genome dataset.
# '''Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks:''' similar ideas were applied, where Permutation Invariant CNN, are used to restore sharp and noise-free images from bursts of photographs affected by hand tremor and noise. This presented good quality images with lots of details for challenging datasets.

=Experimental Results=
==Synthetic Graph Labeling==
The authors created a synthetic problem to study GPI. This involved using an input graph <math>G = (V,E)</math> where each node <math>i</math> belongs to the set <math>\Gamma(i) \in \{1, \dots, K\}</math> where <math>K</math> is the number of samples. The task is to compute for each node, the number of neighbours that belong to the same set (i.e. finding the label of the node <math>i</math> if <math>y_i = \sum_{j \in N(i)} \mathbf{1}[\Gamma(i) = \Gamma(j)]</math>) . Then, random graphs (each with 10 nodes) were generated by sampling edges, and the set <math>\Gamma(i) \in \{1, \dots, K\}</math>for each node independently and uniformly.
The node features of the graph <math>z_i \in \{0,1\}^K</math> are one-hot vectors of <math>\Gamma(i)</math>, and each pairwise edge feature <math>z_{ij} \in \{0, 1\}</math> denote whether the edge <math>ij</math> is in the edge set <math>E</math>.
3 architectures were studied in this paper:
# '''GPI-architecture for graph prediction''' (without attention and RNN)
# '''LSTM''': replacing <math>\sum \phi(\cdot)</math> and <math>\sum \alpha(\cdot)</math> in the form of Theorem 1 using two LSTMs with state size 200, reading their input in random order
# '''Fully connected feed-forward network''': with 2 hidden layers, each layer containing 1,000 nodes; the input is a concatenation of all nodes and pairwise features, and the output is all node predictions

The results show that the GPI architecture requires far fewer samples to converge to the correct solution.
[[File:GPI_synthetic_example.jpg|450px|center]]

==Scene-Graph Classification==
Applying the concept of GPI to Scene-Graph Prediction (SGP) is the main task of this paper. The input to this problem is an image, along with a set of annotated bounding boxes for the entities in the image. The goal is to correctly label each entity within the bounding boxes and the relationship between every pair of entities, resulting in a coherent scene graph.

The authors describe two different types of variables to predict. The first type is entity variables <math>[y_1, \dots, y_n]</math> for all bounding boxes, where each <math>y_i</math> can take one of L values and refers to objects such as "dog" or "man". The second type is relation variables <math>[y_{n+1}, \cdots, y_{n^2}]</math>, where each <math>y_i</math> represents the relation (e.g. "on", "below") between a pair of bounding boxes (entities).

The scene graph and contain two types of edges:
# '''Entity-entity edge''': connecting two entities <math>y_i</math> and <math>y_j</math> for <math>1 \leq i \neq j \leq n</math>
# '''Entity-relation edges''': connecting every relation variable <math>y_k</math> for <math>k > n</math> to two entities

The feature set <math>\mathbf{z}</math> is based on the baseline model from Zellers et al. (2017). For entity variables <math>y_i</math>, the vector <math>\mathbf{z}_i \in \mathbb{R}^L</math> models the probability of the entity appearing in <math>y_i</math>. <math>\mathbf{z}_i</math> is augmented by the coordinates of the bounding box. Similarly for relation variables <math>y_j</math>, the vector <math>\mathbf{z}_j \in \mathbb{R}^R</math>, models the probability of the relations between the two entities in <math>j</math>. For entity-entity pairwise features <math>\mathbf{z}_{i,j}</math>, there is a similar representation of the probabilities for the pair. The SGP outputs probability distributions over all entities and relations, which will then be used as input recurrently to maintain GPI. Finally, word embeddings are used and concatenated for the most probable entity-relation labels.

'''Components of the GPI architecture''' (ent for entity, rel for relation)
# <math>\phi_{ent}</math>: network that integrates two entity variables <math>y_i</math> and <math>y_j</math>, with input <math>z_i, z_j, z_{i,j}</math> and output vector of <math>\mathbb{R}^{n_1}</math>
# <math>\alpha_{ent}</math>: network with inputs from <math>\phi_{ent}</math> for all neighbours of an entity, and uses attention mechanism to output vector <math>\mathbb{R}^{n_2}</math>
# <math>\rho_{ent}</math>: network with inputs from the various <math>\mathbb{R}^{n_2}</math> vectors, and outputs <math>L</math> logits to predict entity value
# <math>\rho_{rel}</math>: network with inputs <math>\alpha_{ent}</math> of two entities and <math>z_{i,j}</math>, and output into <math>R</math> logits

==Set-up and Results==
'''Dataset''': based on Visual Genome (VG) by (Krishna et al., 2017), which contains a total of 108,077 images annotated with bounding boxes, entities, and relations. An average of 12 entities and 7 relations exist per image. For a fair comparison with previous works, data from (Xu et al., 2017) for train and test splits were used. The authors used the same 150 entities and 50 relations as in (Xu et al., 2017; Newell & Deng, 2017; Zellers et al., 2017). Hyperparameters were tuned using a 70K/5K/32K split for training, validation, and testing respectively.

'''Training''': all networks were trained using the Adam optimizer, with a batch size of 20. The loss function was the sum of cross-entropy losses over all of entities and relations. Penalties for misclassified entities were 4 times stronger than that of relations. Penalties for misclassified negative relations were 10 times weaker than that of positive relations.

'''Evaluation''': there are three major tasks when inferring from the scene graph. The authors focus on the following:
# '''SGCIs''': given ground-truth entity bounding boxes, predict all entity and relations categories
# '''PredCIs''': given annotated bounding boxes with entity labels, predict all relations

The evaluation metric Recall@K (shortened to R@K) is drawn from (Lu et al., 2016). This metric is the fraction of correct ground-truth triplets that appear within the <math>K</math> most confident triplets predicted by the model. Graph-constrained protocol requires the top-<math>K</math> triplets to assign one consistent class per entity and relation. The unconstrained protocol does not enforce such constraint.

'''Models and baselines''': The authors compared variants of the GPI approach against four baselines, state-of-the-art models on completing scene graph sub-tasks. To maintain consistency, all models used the same training/testing data split, in addition to the preprocessing as per (Xu et al., 2017).

'''Baselines from existing state-of-the-art models'''
# (Lu et al., 2016): use of word embeddings to fine-tune the likelihood of predicted relations
# (Xu et al., 2017): message passing algorithm between entities and relations to iteratively improve feature map for prediction
# (Newell & Deng, 2017): Pixel2Graph, uses associative embeddings to produce a full graph from image
# (Zellers et al., 2017): NeuralMotif method, encodes global context to capture higher-order motif in scene graphs; Baseline outputs entities and relations distributions without using global context

'''GPI models'''
# '''GPI with no attention mechanism''': simply following Theorem 1's functional form, with summation over features
# '''GPI NeighborAttention''': same GPI model, but considers attention over neighbours features
# '''GPI Linguistic''': similar to NeighborAttention model, but concatenates word embedding vectors

'''Key Results''': The GPI Linguistic approach outperforms all baseline for SGCIs, and has similar performance to the state of the art NeuralMotifs method. The authors argue that PredCI is an easier task with less structure, yielding high performance for the existing state of the art models.

[[File:GPI_table_results.png|700px|center]]

=Conclusion=

A deep learning approach was presented in this paper to structured prediction, which constrains the architecture to be invariant to structurally identical inputs. This approach relies on pairwise features which are capable of describing inter-label correlations and inherits the intuitive aspect of score-based approaches. The output produced is invariant to equivalent representation of the pairwise terms.

As future work, the axiomatic approach can be extended; for example in image labeling, geometric variances such as shifts or rotations may be desired (or in other cases invariance to feature permutations may be desired). Additionally, exploring algorithms that discover symmetries for deep structured prediction when invariant structure is unknown and should be discovered from data is also an interesting extension of this work.

=Critique=
The paper's contribution comes from the novelty of the permutation invariance as a design guideline for structured prediction. Although not explicitly considered in many of the previous works, the idea of invariance in architecture has already been considered in Deep Sets by (Zaheer et al., 2017). This paper characterizes relaxes the condition on the invariance as compared to that of previous works. In the evaluation of the benefit of GPI models, the paper used a synthetic problem to illustrate the fact that far fewer samples are required for the GPI model to converge to 100% accuracy. However, when comparing the true task of scene graph prediction against the state-of-the-art baselines, the GPI variants had only marginal higher Recall@K scores. The true benefit of this paper's discovery is the avoidance of maximizing a score function (leading computationally difficult problem), and instead directly producing output invariant to how we represent the pairwise terms.

=References=
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson, Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction, 2018.

Additional resources from Moshiko Raboh's [https://github.com/shikorab/SceneGraph GitHub]

Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

2018-11-22T01:29:49Z

Vrajendr: /* Conclusion */

The paper ''Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction'' was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as of May 2018) obtained from [https://arxiv.org/pdf/1802.05451v3.pdf arXiv]

(*) Equal contribution

=Motivation=
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance and proves state of the art performance on models that follow this principle.

The primary contributions that this paper makes include:
# Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures
# Empirically proving the benefit of graph-permutation invariance
# Developing a state-of-the-art model for scene graph predictions over a large set of complex visual scenes

=Introduction=
In order for a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A '''scene graph''' is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes and relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.

[[File:scene_graph_example.png|600px|center]]

Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.

In structured prediction models, a score function <math>s(x, y)</math> is defined to evaluate the compatibility between label <math>y</math> and input <math>x</math>. For instance, when interpreting the scene of an image, <math>x</math> refers to the image itself, and <math>y</math> refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label <math>y*</math> such that <math>s(x,y)</math> is maximized. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function.

The paper presents an alternative approach, for which input <math>x</math> is mapped to structured output <math>y</math> using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.

=Structured prediction=
This paper further considers structured predictions using score-based methods. For structured predictions that follow a score-based approach, a score function <math>s(x, y)</math> is used to measure how compatible label <math>y</math> is for input <math>x</math>. To optimize the score function, previous works have decomposed <math>s(x,y) = \sum_i f_i(x,y)</math> in order to facilitate efficient optimization for each local score function, <math>\max_y f_i(x,y)</math>.

In the area of structured predictions, the most commonly-used score functions include the singleton score function <math>f_i(y_i, x)</math> and pairwise score function <math>f_{ij} (y_i, y_j, x)</math>. Previous works explored a two-stage architectures (learn local scores independently of the structured prediction goal), and end-to-end architectures (to include the inference algorithm within the computation graph).

==Advantages of using score-based methods==
# Allow for intuitive specification of local dependencies between labels, and how they map to global dependencies
# Linear score functions offer natural convex surrogates
# Inference in large label space is sometimes possible via exact algorithms or empirically accurate approximations

The concern for modelling score functions using deep networks is that learning may no longer be convex. Hence, the paper presents properties for how deep networks can be used for structured predictions by considering architectures that do not require explicit maximization of a score function.

=Background, Notations, and Definitions=
We denote <math>y</math> as a structured label where <math>y = [y_1, \dots, y_n]</math>

'''Score functions:''' for score-based methods, the score is defined as either the sum of a set of singleton scores <math>f_i = f_i(y_i, x)</math> or the sum of pairwise scores <math>f_{ij} = f_{ij}(y_i, y_j, x)</math>.

Let <math>s(x,y)</math> be the score of a score-based method. Then:

<div align="center">
<math>s(x,y) = \begin{cases}
\sum_i f_i ~ \text{if we have a set of singleton scores}\\
\sum_{ij} f_{ij} ~ \text{if we have a set of pairwise scores } \\
\end{cases}</math>
</div>

'''Inference algorithm:''' an inference algorithm takes input set of local scores (either <math>f_i</math> or <math>f_{ij}</math>) and outputs an assignment of labels <math>y_1, \dots, y_n</math> that maximizes score function <math>s(x,y)</math>

'''Graph labeling function:''' a graph labeling function <math>\mathcal{F} : (V,E) \rightarrow Y</math> is a function that takes input of: an ordered set of node features <math>V = [z_1, \dots, z_n]</math> and an ordered set of edge features <math>E = [z_{1,2},\dots,z_{i,j},\dots,z_{n,n-1}]</math> to output set of node labels <math>\mathbf{y} = [y_1, \dots, y_n]</math>. For instance, <math>z_i</math> can be set equal to <math>f_i</math> and <math>z_{ij}</math> can be set equal to <math>f_{ij}</math>.

For convenience, the joint set of nodes and edges will be denoted as <math>\mathbf{z}</math> to be a size <math>n^2</math> vector (<math>n</math> nodes and <math>n(n-1)</math> edges).

'''Permutation:''' Let <math>z</math> be a set of node and edge features. Given a permutation <math>\sigma</math> of <math>\{1,\dots,n\}</math>, let <math>\sigma(z)</math> be a new set of node and edge features given by [<math>\sigma(z)]_i = z_{\sigma(i)}</math> and <math>[\sigma(z)]_{i,j} = z_{\sigma(i), \sigma(j)}</math>

'''One-hot representation:''' <math>\mathbf{1}[j]</math> be a one-hot vector with 1 in the <math>j^{th}</math> coordinate

=Permutation-Invariant Structured prediction=

With permutation-invariant structured prediction, we would expect the algorithm to produce the same result given the same score function. For instance, consider the case where we have label space for 3 variables <math>y_1, y_2, y_3</math> with input <math>\mathbf{z} = (f_1, f_2, f_3, f_{12}, f_{13}, f_{23})</math> that outputs label <math>\mathbf{y} = (y_1^*, y_2^*, y_3^*)</math>. Then if the algorithm is run on a permuted version input <math>z' = (f_2, f_1, f_3, f_{21}, f_{23}, f_{13})</math>, we would expect <math>\mathbf{y} = (y_2^*, y_1^*, y_3^*)</math> given the same score function.

'''Graph permutation invariance (GPI):''' a graph labeling function <math>\mathcal{F}</math> is graph-permutation invariant, if for all permutations <math>\sigma</math> of <math>\{1, \dots, n\}</math> and for all nodes <math>z</math>, <math>\mathcal{F}(\sigma(\mathbf{z})) = \sigma(\mathcal{F}(\mathbf{z}))</math>

The paper presents a theorem on the necessary and sufficient conditions for a function <math>\mathcal{F}</math> to be graph permutation invariant. Intuitively, because <math>\mathcal{F}</math> is a function that takes an ordered set <math>z</math> as input, the output on <math>\mathbf{z}</math> could very well be different from <math>\sigma(\mathbf{z})</math>, which means <math>\mathcal{F}</math> needs to have some sort of symmetry in order to sustain <math>[\mathcal{F}(\sigma(\mathbf{z}))]]_k = [\mathcal{F}(\mathbf{z})]_{\sigma(k)}</math>.

[[File:graph_permutation_invariance.jpg|400px|center]]

==Theorem 1==
Let <math>\mathcal{F}</math> be a graph labeling function. Then <math>\mathcal{F}</math> is graph-permutation invariant if and only if there exist functions <math>\alpha, \rho, \phi</math> such that for all <math>k=1, .., n</math>:
\begin{align}
[\mathcal{F}(\mathbf{z})]_k = \rho(\mathbf{z}_k, \sum_{i=1}^n \alpha(\mathbf{z}_i, \sum_{i\neq j} \phi(\mathbf{z}_i, \mathbf{z}_{i,j}, \mathbf{z}_j)))
\end{align}
where <math>\phi: \mathbb{R}^{2d+e} \rightarrow \mathbb{R}^L, \alpha: \mathbb{R}^{d + L} \rightarrow \mathbb{R}^{W}, p: \mathbb{R}^{W+d} \rightarrow \mathbb{R}</math>.

Notice that for the dimensions of inputs and outputs, <math>d</math> refers to the number of singleton features in <math>z</math> and <math>e</math> refers to the number of edges.

[[File:GPI_architecture.jpg|thumb|A schematic representation of the GPI architecture. Singleton features <math>z_i</math> are omitted for simplicity. First, the features <math>z_{i,j}</math> are processed element-wise by <math>\phi</math>. Next, they are summed to create a vector <math>s_i</math>, which is concatenated with <math>z_i</math>. Third, a representation of the entire graph is created by applying <math>\alpha\ n</math> times and summing the created vector. The graph representation is then finally processed by <math>\rho</math> together with <math>z_k</math>.|600px|center]]

==Proof Sketch for Theorem 1==
The proof of this theorem can be found in the paper. A proof sketch below is provided:

'''For the forward direction''' (function that follows the form set out in equation (1) is GPI):
# Using definition of permutation <math>\sigma</math>, and rewriting <math>[F(z)]_{\sigma(k)}</math> in the form from equation (1)
# Second argument of <math>\rho</math> is invariant under <math>\sigma</math>, since it takes the sum of all indices <math>i</math> and all other indices <math>j \neq i </math>.

'''For the backward direction''' (any black-box GPI function can be expressed in the form of equation 1):
# Construct <math>\phi, \alpha</math> such that second argument of <math>\rho</math> contains all information about graph features of <math>z</math>, including edges that the features originate from
# Assume each <math>z_k</math> uniquely identifies the node and <math>\mathcal{F}</math> is a function only of pairwise features <math>z_{i,j}</math>
# Construct <math>H</math> be a perfect hash function with <math>L</math> buckets, and <math>\phi</math> which maps '''pairwise features''' to a vector of size <math>L</math>
# <math>*</math>Construct <math>\phi(z_i, z_{i,j}, z_j) = \mathbf{1}[H(z_j)] z_{i,j}</math>, which intuitively means that <math>\phi</math> stores <math>z_{i,j}</math> in the unique bucket for node <math>j</math>
# Construct function <math>\alpha</math> to output a matrix <math>\mathbb{R}^{L \times L}</math> that maps each pairwise feature into unique positions (<math>\alpha(z_i, s_i) = \mathbf{1}[H(z_i)]s_i^T</math>)
# Construct matrix <math>M = \sum_i \alpha(z_i,s_i)</math> by discarding rows/columns in <math>M</math> that do not correspond to original nodes (which reduces dimension to <math>n\times n</math>; set <math>\rho</math> to have same outcome as <math>\mathcal{F}</math>, and set the output of <math>\mathcal{F}</math> on <math>M</math> to be the labels <math>\mathbf{y} = y_1, \dots, y_n</math>

<math>*</math>The paper presents the proof for the edge features <math>z_{ij}</math> being scalar (<math>e = 1</math>) for simplicity, which can be extended easily to vectors with additional indexing.

Although the results discussed previously apply to complete graphs (edges apply to all feature pairs), it can be easily extended to incomplete graphs. However, in place of permutation-invariance, it is now an automorphism-invariance.

==Implications and Applications of Theorem 1==
===Key Implications of Theorem 1===
# Architecture "collects" information from the different edges of the graph, and does so in an invariant fashion using <math>\alpha</math> and <math>\phi</math>
# Architecture is parallelizable, since all <math>\phi</math> functions can be applied simultaneously

===Some applications of Theorem 1===
# '''Attention:''' the concept of attention can be implemented in the GPI characterization, with slight alterations to the functions <math>\alpha</math> and <math>\phi</math>. In attention each node aggregates features of neighbours through a function of neighbour's relevance. Which means the lable of an entity could depend strongly on its close entity. The complete details can be found in the supplementary materials of the paper.

# '''RNN:''' recurrent architectures can maintain GPI property, since all GPI function <math>\mathcal{F}</math> are closed under composition. The output of one step after running <math>\mathcal{F}</math> will act as input for the next step, but maintain the GPI property throughout.

=Related Work=
# '''Architectural invariance:''' suggested recently in a 2017 paper called Deep Sets by Zaheer et al., which considers the case of invariance that is more restrictive.
# '''Deep structured prediction:''' previous work applied deep learning to structured prediction, for instance, semantic segmentation. Some algorithms include message passing algorithms, gradient descent for maximizing score functions, greedy decoding (inference of labels based on time of previous labels). Apart from those algorithms, deep learning has been applied to other graph-based problems such as the Travelling Salesman Problem (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017). However, none of the previous work specifically address the notion of invariance in the general architecture, but rather focus on message passing architectures that can be generalized by this paper.
# '''Scene graph prediction:''' scene graph extraction allows for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017). Some other works in this area include object detection, action recognition, and even detection of human-object interactions (Liao et al., 2016; Plummer et al., 2017). Additional work has been done with the use of message passing algorithms (Xu et al., 2017), word embeddings (Lu et al., 2016), and end-to-end prediction directly from pixels (Newell & Deng, 2017). A notable mention is NeuralMotif (Zellers et al., 2017), which the authors describe as the current state-of-the-art model for scene graph predictions on Visual Genome dataset.
# '''Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks:''' similar ideas were applied, where Permutation Invariant CNN, are used to restore sharp and noise-free images from bursts of photographs affected by hand tremor and noise. This presented good quality images with lots of details for challenging datasets.

=Experimental Results=
==Synthetic Graph Labeling==
The authors created a synthetic problem to study GPI. This involved using an input graph <math>G = (V,E)</math> where each node <math>i</math> belongs to the set <math>\Gamma(i) \in \{1, \dots, K\}</math> where <math>K</math> is the number of samples. The task is to compute for each node, the number of neighbours that belong to the same set (i.e. finding the label of the node <math>i</math> if <math>y_i = \sum_{j \in N(i)} \mathbf{1}[\Gamma(i) = \Gamma(j)]</math>) . Then, random graphs (each with 10 nodes) were generated by sampling edges, and the set <math>\Gamma(i) \in \{1, \dots, K\}</math>for each node independently and uniformly.
The node features of the graph <math>z_i \in \{0,1\}^K</math> are one-hot vectors of <math>\Gamma(i)</math>, and each pairwise edge feature <math>z_{ij} \in \{0, 1\}</math> denote whether the edge <math>ij</math> is in the edge set <math>E</math>.
3 architectures were studied in this paper:
# '''GPI-architecture for graph prediction''' (without attention and RNN)
# '''LSTM''': replacing <math>\sum \phi(\cdot)</math> and <math>\sum \alpha(\cdot)</math> in the form of Theorem 1 using two LSTMs with state size 200, reading their input in random order
# '''Fully connected feed-forward network''': with 2 hidden layers, each layer containing 1,000 nodes; the input is a concatenation of all nodes and pairwise features, and the output is all node predictions

The results show that the GPI architecture requires far fewer samples to converge to the correct solution.
[[File:GPI_synthetic_example.jpg|450px|center]]

==Scene-Graph Classification==
Applying the concept of GPI to Scene-Graph Prediction (SGP) is the main task of this paper. The input to this problem is an image, along with a set of annotated bounding boxes for the entities in the image. The goal is to correctly label each entity within the bounding boxes and the relationship between every pair of entities, resulting in a coherent scene graph.

The authors describe two different types of variables to predict. The first type is entity variables <math>[y_1, \dots, y_n]</math> for all bounding boxes, where each <math>y_i</math> can take one of L values and refers to objects such as "dog" or "man". The second type is relation variables <math>[y_{n+1}, \cdots, y_{n^2}]</math>, where each <math>y_i</math> represents the relation (e.g. "on", "below") between a pair of bounding boxes (entities).

The scene graph and contain two types of edges:
# '''Entity-entity edge''': connecting two entities <math>y_i</math> and <math>y_j</math> for <math>1 \leq i \neq j \leq n</math>
# '''Entity-relation edges''': connecting every relation variable <math>y_k</math> for <math>k > n</math> to two entities

The feature set <math>\mathbf{z}</math> is based on the baseline model from Zellers et al. (2017). For entity variables <math>y_i</math>, the vector <math>\mathbf{z}_i \in \mathbb{R}^L</math> models the probability of the entity appearing in <math>y_i</math>. <math>\mathbf{z}_i</math> is augmented by the coordinates of the bounding box. Similarly for relation variables <math>y_j</math>, the vector <math>\mathbf{z}_j \in \mathbb{R}^R</math>, models the probability of the relations between the two entities in <math>j</math>. For entity-entity pairwise features <math>\mathbf{z}_{i,j}</math>, there is a similar representation of the probabilities for the pair. The SGP outputs probability distributions over all entities and relations, which will then be used as input recurrently to maintain GPI. Finally, word embeddings are used and concatenated for the most probable entity-relation labels.

'''Components of the GPI architecture''' (ent for entity, rel for relation)
# <math>\phi_{ent}</math>: network that integrates two entity variables <math>y_i</math> and <math>y_j</math>, with input <math>z_i, z_j, z_{i,j}</math> and output vector of <math>\mathbb{R}^{n_1}</math>
# <math>\alpha_{ent}</math>: network with inputs from <math>\phi_{ent}</math> for all neighbours of an entity, and uses attention mechanism to output vector <math>\mathbb{R}^{n_2}</math>
# <math>\rho_{ent}</math>: network with inputs from the various <math>\mathbb{R}^{n_2}</math> vectors, and outputs <math>L</math> logits to predict entity value
# <math>\rho_{rel}</math>: network with inputs <math>\alpha_{ent}</math> of two entities and <math>z_{i,j}</math>, and output into <math>R</math> logits

==Set-up and Results==
'''Dataset''': based on Visual Genome (VG) by (Krishna et al., 2017), which contains a total of 108,077 images annotated with bounding boxes, entities, and relations. An average of 12 entities and 7 relations exist per image. For a fair comparison with previous works, data from (Xu et al., 2017) for train and test splits were used. The authors used the same 150 entities and 50 relations as in (Xu et al., 2017; Newell & Deng, 2017; Zellers et al., 2017). Hyperparameters were tuned using a 70K/5K/32K split for training, validation, and testing respectively.

'''Training''': all networks were trained using the Adam optimizer, with a batch size of 20. The loss function was the sum of cross-entropy losses over all of entities and relations. Penalties for misclassified entities were 4 times stronger than that of relations. Penalties for misclassified negative relations were 10 times weaker than that of positive relations.

'''Evaluation''': there are three major tasks when inferring from the scene graph. The authors focus on the following:
# '''SGCIs''': given ground-truth entity bounding boxes, predict all entity and relations categories
# '''PredCIs''': given annotated bounding boxes with entity labels, predict all relations

The evaluation metric Recall@K (shortened to R@K) is drawn from (Lu et al., 2016). This metric is the fraction of correct ground-truth triplets that appear within the <math>K</math> most confident triplets predicted by the model. Graph-constrained protocol requires the top-<math>K</math> triplets to assign one consistent class per entity and relation. The unconstrained protocol does not enforce such constraint.

'''Models and baselines''': The authors compared variants of the GPI approach against four baselines, state-of-the-art models on completing scene graph sub-tasks. To maintain consistency, all models used the same training/testing data split, in addition to the preprocessing as per (Xu et al., 2017).

'''Baselines from existing state-of-the-art models'''
# (Lu et al., 2016): use of word embeddings to fine-tune the likelihood of predicted relations
# (Xu et al., 2017): message passing algorithm between entities and relations to iteratively improve feature map for prediction
# (Newell & Deng, 2017): Pixel2Graph, uses associative embeddings to produce a full graph from image
# (Zellers et al., 2017): NeuralMotif method, encodes global context to capture higher-order motif in scene graphs; Baseline outputs entities and relations distributions without using global context

'''GPI models'''
# '''GPI with no attention mechanism''': simply following Theorem 1's functional form, with summation over features
# '''GPI NeighborAttention''': same GPI model, but considers attention over neighbours features
# '''GPI Linguistic''': similar to NeighborAttention model, but concatenates word embedding vectors

'''Key Results''': The GPI Linguistic approach outperforms all baseline for SGCIs, and has similar performance to the state of the art NeuralMotifs method. The authors argue that PredCI is an easier task with less structure, yielding high performance for the existing state of the art models.

[[File:GPI_table_results.png|700px|center]]

=Conclusion=

A deep learning approach was presented in this paper to structured prediction, which constrains the architecture to be invariant to structurally identical inputs. This approach relies on pairwise features which are capable of describing inter-label correlations and inherits the intuitive aspect of score-based approaches. The output produced is invariant to equivalent representation of the pairwise terms.

As future work, the axiomatic approach can be extended; for example in image labeling, geometric variances such as shifts or rotations may be desired (or in other cases invariance to feature permutations may be desired). Additionally, exploring algorithms that discover symmetries for deep structured prediction when invariant structure is unknown and should be discovered from data is also an interesting extension of this work.

=Critique=
The paper's contribution comes from the novelty of the permutation invariance as a design guideline for structured prediction. Although not explicitly considered in many of the previous works, the idea of invariance in architecture has already been considered in Deep Sets by (Zaheer et al., 2017). This paper characterizes relaxes the condition on the invariance as compared to that of previous works. In the evaluation of the benefit of GPI models, the paper used a synthetic problem to illustrate the fact that far fewer samples are required for the GPI model to converge to 100% accuracy. However, when comparing the true task of scene graph prediction against the state-of-the-art baselines, the GPI variants had only marginal higher Recall@K scores. The true benefit of this paper's discovery is the avoidance of maximizing a score function (leading computationally difficult problem), and instead directly producing output invariant to how we represent the pairwise terms.

=References=
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson, Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction, 2018.

Additional resources from Moshiko Raboh's [https://github.com/shikorab/SceneGraph GitHub]

conditional neural process

2018-11-22T01:17:37Z

Vrajendr: /* Experimental Result I: Function Regression */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is you minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_{\theta}</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP. The function outputs a Gaussian mean and variance for the target outputs. The model is trained to maximize the log-likelihood of the target points using the Adam optimizer.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|400px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Other Sources ==
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-22T01:06:42Z

Vrajendr: /* Model */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is you minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^{n-1}</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_{\theta}</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_\theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|400px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Other Sources ==
# Code for this model and a simpler explanation can be found at [https://github.com/deepmind/conditional-neural-process]
# A newer version of the model is described in this paper [https://arxiv.org/pdf/1807.01622.pdf]
# A good blog post on neural processes [https://kasparmartens.rbind.io/post/np/]

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

Learning to Navigate in Cities Without a Map

2018-11-22T00:47:33Z

Vrajendr: /* Agent Interface and the Courier Task */

Paper:
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]
A video of the paper is available here[https://sites.google.com/view/streetlearn].

== Introduction ==
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.

1. Building an explicit map

2. Planning and acting using that map.

In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (which have not being used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]

==Contribution==
This paper has made the following contributions:

1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.

2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.

3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.

4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.

==Related Work==

1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment.

2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.

3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.

==Environment==
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.

[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]

==Agent Interface and the Courier Task==
In an RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action). The most central edge is chosen if there are multiple edges in the agents viewing cone.

There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp⁡(-αd_{(t,i)}^g)}{∑_k exp⁡(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).

[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]

This form of representation has several advantages:

1. It could easily be extended to new environments.

2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.

3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.

In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.

==Methods==

===Goal-dependent Actor-Critic Reinforcement Learning===
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} → \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>:

\begin{align}
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)
\end{align}

Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:

\begin{align}
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]
\end{align}

Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.

===Architectures===

[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]

The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.

Three different architectures are described below.

The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.

The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information.

The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.

===Curriculum Learning===
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)

==Results==
In this section, the performance of the proposed architectures on the courier task is shown.

[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way
as in London) are comparable and the agent achieved mean goal reward 426.]]

It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:

1. Goal Navigation agent.

2. City Navigation Agent.

3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.

Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.

The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.

[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]

Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach
a goal (Washington Square in New York or St Paul’s Cathedral in
London) from 100 start locations vs. the straight-line distance to
the goal in meters. One agent step corresponds to a forward movement
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]

A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]

Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:

a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.

b) Binned representation.

The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.

[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]

==Conclusion==
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.

==Critique==
1. It is not clear how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modelling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.

2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.

3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.

Learning to Navigate in Cities Without a Map

2018-11-22T00:44:30Z

Vrajendr: /* Critique */

Paper:
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]
A video of the paper is available here[https://sites.google.com/view/streetlearn].

== Introduction ==
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.

1. Building an explicit map

2. Planning and acting using that map.

In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (which have not being used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]

==Contribution==
This paper has made the following contributions:

1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.

2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.

3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.

4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.

==Related Work==

1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment.

2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.

3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.

==Environment==
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.

[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]

==Agent Interface and the Courier Task==
In RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action).

There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp⁡(-αd_{(t,i)}^g)}{∑_k exp⁡(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).

[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]

This form of representation has several advantages:

1. It could easily be extended to new environments.

2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.

3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.

In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.

==Methods==

===Goal-dependent Actor-Critic Reinforcement Learning===
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} → \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>:

\begin{align}
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)
\end{align}

Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:

\begin{align}
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]
\end{align}

Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.

===Architectures===

[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]

The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.

Three different architectures are described below.

The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.

The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information.

The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.

===Curriculum Learning===
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)

==Results==
In this section, the performance of the proposed architectures on the courier task is shown.

[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way
as in London) are comparable and the agent achieved mean goal reward 426.]]

It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:

1. Goal Navigation agent.

2. City Navigation Agent.

3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.

Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.

The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.

[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]

Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach
a goal (Washington Square in New York or St Paul’s Cathedral in
London) from 100 start locations vs. the straight-line distance to
the goal in meters. One agent step corresponds to a forward movement
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]

A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]

Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:

a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.

b) Binned representation.

The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.

[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]

==Conclusion==
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.

==Critique==
1. It is not clear how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modelling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.

2. This paper is only using static Google Street View images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.

3. The 'Transfer in Multi-City Experiments' results could be strengthened significantly via cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.

Learning to Navigate in Cities Without a Map

2018-11-22T00:42:22Z

Vrajendr: /* Curriculum Learning */

Paper:
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]
A video of the paper is available here[https://sites.google.com/view/streetlearn].

== Introduction ==
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.

1. Building an explicit map

2. Planning and acting using that map.

In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (which have not being used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]

==Contribution==
This paper has made the following contributions:

1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.

2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.

3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.

4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.

==Related Work==

1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment.

2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.

3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.

==Environment==
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.

[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]

==Agent Interface and the Courier Task==
In RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action).

There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp⁡(-αd_{(t,i)}^g)}{∑_k exp⁡(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).

[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]

This form of representation has several advantages:

1. It could easily be extended to new environments.

2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.

3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.

In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.

==Methods==

===Goal-dependent Actor-Critic Reinforcement Learning===
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} → \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>:

\begin{align}
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)
\end{align}

Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:

\begin{align}
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]
\end{align}

Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.

===Architectures===

[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]

The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.

Three different architectures are described below.

The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.

The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information.

The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.

===Curriculum Learning===
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards (similar to Montezuma’s Revenge) . To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)

==Results==
In this section, the performance of the proposed architectures on the courier task is shown.

[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way
as in London) are comparable and the agent achieved mean goal reward 426.]]

It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:

1. Goal Navigation agent.

2. City Navigation Agent.

3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.

Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.

The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.

[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]

Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach
a goal (Washington Square in New York or St Paul’s Cathedral in
London) from 100 start locations vs. the straight-line distance to
the goal in meters. One agent step corresponds to a forward movement
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]

A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]

Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:

a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.

b) Binned representation.

The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.

[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]

==Conclusion==
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.

==Critique==
1. It is not clear that how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modeling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.

2. This paper is only using static google street view images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.

3. The 'Transfer in Multi-City Experiments' results could strengthened significantly from cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.

Learning to Navigate in Cities Without a Map

2018-11-22T00:38:53Z

Vrajendr: /* Related Work */

Paper:
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]
A video of the paper is available here[https://sites.google.com/view/streetlearn].

== Introduction ==
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.

1. Building an explicit map

2. Planning and acting using that map.

In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (which have not being used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]

==Contribution==
This paper has made the following contributions:

1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.

2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.

3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.

4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.

==Related Work==

1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment.

2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.

3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map via an RL agent with external memory; some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.

==Environment==
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.

[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]

==Agent Interface and the Courier Task==
In RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action).

There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp⁡(-αd_{(t,i)}^g)}{∑_k exp⁡(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).

[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]

This form of representation has several advantages:

1. It could easily be extended to new environments.

2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.

3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.

In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.

==Methods==

===Goal-dependent Actor-Critic Reinforcement Learning===
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} → \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>:

\begin{align}
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)
\end{align}

Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:

\begin{align}
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]
\end{align}

Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.

===Architectures===

[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]

The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.

Three different architectures are described below.

The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.

The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information.

The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.

===Curriculum Learning===
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards. To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)

==Results==
In this section, the performance of the proposed architectures on the courier task is shown.

[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way
as in London) are comparable and the agent achieved mean goal reward 426.]]

It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:

1. Goal Navigation agent.

2. City Navigation Agent.

3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.

Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.

The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.

[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]

Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach
a goal (Washington Square in New York or St Paul’s Cathedral in
London) from 100 start locations vs. the straight-line distance to
the goal in meters. One agent step corresponds to a forward movement
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]

A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]

Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:

a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.

b) Binned representation.

The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.

[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]

==Conclusion==
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.

==Critique==
1. It is not clear that how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modeling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.

2. This paper is only using static google street view images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.

3. The 'Transfer in Multi-City Experiments' results could strengthened significantly from cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.

Learning to Navigate in Cities Without a Map

2018-11-22T00:25:30Z

Vrajendr: /* Contribution */

Paper:
Learning to Navigate in Cities Without a Map[https://arxiv.org/pdf/1804.00168.pdf]
A video of the paper is available here[https://sites.google.com/view/streetlearn].

== Introduction ==
Navigation is an attractive topic in many research disciplines and technology related domains such as neuroscience and robotics. The majority of algorithms are based on the following steps.

1. Building an explicit map

2. Planning and acting using that map.

In this article, based on this fact that human can learn to navigate through cities without using any special tool such as maps or GPS, authors propose new methods to show that a neural network agent can do the same thing by using visual observations. To do so, an interactive environment using Google StreetView Images and a dual pathway agent architecture is designed. As shown in figure 1, some parts of the environment are built using Google StreetView images of New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation. Although learning to navigate using visual aids is shown to be successful in some domains such as games and simulated environments using deep reinforcement learning (RL), it suffers from data inefficiency and sensitivity to changes in the environment. Thus, it is unclear whether this method could be used for large-scale navigation. That’s why it became the subject of investigation in this paper.
[[File:figure1-soroush.png|600px|thumb|center|Figure 1. Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps (which have not being used by the agent) in New York City (Times Square, Central Park) and London (St. Paul’s Cathedral). The green cone represents the agent’s location and orientation.]]

==Contribution==
This paper has made the following contributions:

1. Designing a dual pathway agent architecture. This agent can navigate through a real city and is trained with end-to-end reinforcement learning to handle real-world navigations.

2. Using Goal-dependent learning. This means that the policy and value functions must adapt themselves to a sequence of goals that are provided as input.

3. Leveraging a recurrent neural architecture. Using that, not only could navigation through a city be possible, but also the model is scalable for navigation in new cities. This architecture supports both locale-specific learnings and general transferable navigations. The authors achieved these by separating a recurrent neural pathway. This pathway receives and interprets the current goal as well as encapsulates and memorizes features of a single region.

4. Using a new environment which is built on top of Google StreetView images. This provides real-world images for agent’s observation. Using this environment, the agent can navigate from an arbitrary starting point to a goal and then to another goal etc. Also, London, Paris, and New York City are chosen for navigation.

==Related Work==

1. Localization from real-world imagery. For example, (Weyand et al., 2016), a CNN was able to achieve excellent results on geolocation task. This paper provides novel work by not including supervised training with ground-truth labels, and by including planning as a goal. Some other works also improve by exploiting spatiotemporal continuity or estimating camera pose or depth estimation from pixels. These methods rely on supervised training with ground truth labels, which is not possible in every environment.

2. Deep RL methods for navigation. For instance, (Mirowski et al., 2016; Jaderberg et al., 2016) used self-supervised auxiliary tasks to produce visual navigation in several created mazes. Some other researches used text descriptions to incorporate goal instructions. Researchers developed realistic, higher-fidelity environment simulations to make the experiment more realistic, but that still came with lack of diversities. This paper makes use of real-world data, in contrast to many related papers in this area. It's diverse and visually realistic but still, it does not contain dynamic elements, and the street topology cannot be regenerated or altered.

3. Deep RL for path planning and mapping. For example, (Zhang et al., 2017) created an agent that represented a global map; Some other work uses a hierarchical control strategy to propose a structured memory and Memory Augmented Control Maps. Explicit neural mapper and navigation planner with joint training was also used. Among all these works, the target-driven visual navigation with a goal-conditional policy approach was most related to our method.

==Environment==
Google StreetView consists of both high-resolution 360-degree imagery and graph connectivity. Also, it provides a public API. These features make it a valuable resource. In this work, large areas of New York, Paris, and London that contain between 7,000 and 65,500 nodes
(and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m and cover a range of up to
5km chosen (Figure 2), without simplifying the underlying connections. This means that there are many areas 'congested' with nodes, occlusions, available footpaths, etc. The agent only sees RGB images that are visible in StreetView images (Figure 1) and is not aware of the underlying graph.

[[File:figure2-soroush.png|700px|thumb|center|Figure 2. Map of the 5 environments in New York City; our experiments focus on the NYU area as well as on transfer learning from the other areas to Wall Street (see Section 5.3). In the zoomed in area, each green dot corresponds to a unique panorama, the goal is marked in blue, and landmark locations are marked with red pins.]]

==Agent Interface and the Courier Task==
In RL environment, we need to define observations and actions in addition to tasks. The inputs to the agent are the image <math>x_t</math> and the goal <math>g_t</math>. Also, a first-person view of the 3D environment is simulated by cropping <math>x_t</math> to a 60-degree square RGB image that is scaled to 84*84 pixels. Furthermore, the action space consists of 5 movements: “slow” rotate left or right (±22:5), “fast” rotate left or right (±67.5), or move forward (implemented as a ''noop'' in the case where this is not a viable action).

There are lots of ways to specify the goal to the agent. In this paper, the current goal is chosen to be represented in terms of its proximity to a set L of fixed landmarks <math> L={(Lat_k, Long_k)}</math> which are specified using Latitude and Longitude coordinate system. For distance to the <math> k_{th}</math> landmark <math>{(d_{(t,k)}^g})_k</math> the goal vector contains <math> g_{(t,i)}=\tfrac{exp⁡(-αd_{(t,i)}^g)}{∑_k exp⁡(-αd_{(t,k)}^g)} </math>for <math>i_{th}</math> landmark with <math>α=0.002</math> (Figure 3).

[[File:figure3-soroush.PNG|400px|thumb|center|Figure 3. We illustrate the goal description by showing a goal and a set of 5 landmarks that are nearby, plus 4 that are more distant. The code <math>g_i</math> is a vector with a softmax-normalised distance to each landmark.]]

This form of representation has several advantages:

1. It could easily be extended to new environments.

2. It is intuitive. Even humans and animals use landmarks to be able to move from one place to another.

3. It does not rely on arbitrary map coordinates, and provides an absolute (as opposed to relative) goal.

In this work, 644 landmarks for New York, Paris, and London are manually defined. The courier task is the problem of navigating to a list of random locations within a city. In each episode, which consists of 1000 steps, the agent starts from a random place with random orientation. when an agent gets within 100 meters of goal, the next goal is randomly chosen. An episode ends after 1000 agent steps. Finally, the reward is proportional to the shortest path between agent and goal when the goal is first assigned (providing more reward for longer journeys). Thus the agent needs to learn the mapping between the images observed at the goal location and the goal vector in order to solve the courier task problem. Furthermore, the agent must learn the association between the images observed at its current location and the policy to reach the goal destination.

==Methods==

===Goal-dependent Actor-Critic Reinforcement Learning===
In this paper, the learning problem is based on Markov Decision Process, with state space <math>\mathcal{S}</math>, action space <math>\mathcal{A}</math>, environment <math>\mathcal{E}</math>, and a set of possible goals <math>\mathcal{G}</math>. The reward function depends on the current goal and state: <math>\mathcal{R}: \mathcal{S} \times \mathcal{G} \times \mathcal{A} → \mathbb{R}</math>. Typically, in reinforcement learning the main goal is to find the policy which maximizes the expected return. Expected return is defined as the sum of
discounted rewards starting from state <math>s_0</math> with discount <math>\gamma</math>. Also, the expected return from a state <math>s_t</math> depends on the goals that are sampled. The policy is defined as a distribution over the actions, given the current state <math>s_t</math> and the goal <math>g_t</math>:

\begin{align}
\pi(\alpha|s,g)=Pr(\alpha_t=\alpha|s_t=s, g_t=g)
\end{align}

Value function is defined as the expected return obtained by sampling actions from policy <math>\pi</math> from state <math>s_t</math> with goal <math>g_t</math>:

\begin{align}
V^{\pi}(s,g)=E[R_t]=E[Σ_{k=0}^{\infty}\gamma^kr_{t+k}|s_t=s, g_t=g]
\end{align}

Also, an architecture with multiple pathways is designed to support two types of learning that is required for this problem. First, an agent needs an internal representation which is general and gives an understanding of a scene. Second, to better understand a scene the agent needs to remember unique features of the scene which then help the agent to organize and remember the scenes.

===Architectures===

[[File:figure4-soroush.png|400px|thumb|center|Figure 4. Comparison of architectures. Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (θ). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.]]

The authors use neural networks to parameterize policy and value functions. These neural networks share weights in all layers except the final linear layer. The agent takes image pixels as input. These pixels are passed through a convolutional network. The output of the Convolution network is fed to a Long Short-Term Memory (LSTM) as well as the past reward <math>r_{t-1}</math> and previous action <math>\alpha_{t-1}</math>.

Three different architectures are described below.

The '''GoalNav''' architecture (Fig. 4a) which consists of a convolutional architecture and policy LSTM. Goal description <math>g_t</math>, previous action, and reward are the inputs of this LSTM.

The '''CityNav''' architecture (Fig. 4b) consists of the previous architecture alongside an additional LSTM, called the goal LSTM. Inputs of this LSTM are visual features and the goal description. The CityNav agent also adds an auxiliary heading (θ) prediction task which is defined as an angle between the north direction and the agent’s pose. This auxiliary task can speed up learning and provides relevant information.

The '''MultiCityNav''' architecture (Fig. 4c) is an extension of City-Nav for learning in different cities. This is done using the parallel connection of goal LSTMs for encapsulating locale-specific features, for each city. Moreover, the convolutional architecture and the policy LSTM become general after training on a number of cities. So, new goal LSTMs are required to be trained in new cities.

===Curriculum Learning===
In curriculum learning, the model is trained using simple examples in first steps. As soon as the model learns those examples, more complex and difficult examples would be fed to the model. In this paper, this approach is used to teach agent to navigate to further destinations. This courier task suffers from a common problem of RL tasks which is sparse rewards. To overcome this problem, a natural curriculum scheme is defined, in which sampling each new goal would be within 500m of the agent’s position. This is called phase 1. In phase 2, the maximum range is gradually increased to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Downtown Manhattan)

==Results==
In this section, the performance of the proposed architectures on the courier task is shown.

[[File:figure5-2.png|600px|thumb|center|Figure 5. Average per-episode goal rewards (y-axis) are plotted vs. learning steps (x-axis) for the courier task in the NYU (New York City) environment (top), and in central London (bottom). We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment, and the CityNav agent in London. We also compare the Oracle performance and a Heuristic agent, described below. The London agents were trained with a 2-phase curriculum– we indicate the end of phase 1 (500m only) and the end of phase 2 (500m to 5000m). Results on the Rive Gauche part of Paris (trained in the same way
as in London) are comparable and the agent achieved mean goal reward 426.]]

It is first shown that the CityNav agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. Figure 5 compares the following agents:

1. Goal Navigation agent.

2. City Navigation Agent.

3. A City Navigation agent without the skip connection from the vision layers to the policy LSTM. This is needed to regularise the interface between the goal LSTM and the policy LSTM in multi-city transfer scenario.

Also, a lower bound (Heuristic) and an upper bound(Oracle) on the performance is considered. As it is said in the paper: "Heuristic is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability <math>P=0.95</math>. Oracle uses the full graph to compute the optimal path using breadth-first search.". As it is clear in Figure 5, CityNav architecture with the previously mentioned architecture attains a higher performance and is more stable than the simpler GoalNav agent.

The trajectories of the trained agent over two 1000 step episodes and the value function of the agent during navigation to a destination is shown in Figure 6.

[[File:figure6-soroush.png|400px|thumb|center|Figure 6. Trained CityNav agent’s performance in two environments: Central London (left panes), and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualize the value function of the agent during 100 trajectories with random starting points and the same goal (respectively St Paul’s Cathedral and Washington Square). Thicker and warmer color lines correspond to higher value functions.]]

Figure 7 shows that navigation policy is learned by agent successfully in St Paul’s Cathedral in London and Washington Square in New York.
[[File:figure7-soroush.png|400px|thumb|center|Figure 7. Number of steps required for the CityNav agent to reach
a goal (Washington Square in New York or St Paul’s Cathedral in
London) from 100 start locations vs. the straight-line distance to
the goal in meters. One agent step corresponds to a forward movement
of about 10m or a left/right turn by 22.5 or 67.5 degrees.]]

A critical test for this article is to transfer model to new cities by learning a new set of landmarks, but without re-learning visual representation, behaviors, etc. Therefore, the MultiCityNav agent is trained on a number of cities besides freezing both the policy LSTM and the convolutional encoder. Then a new locale-specific goal LSTM is trained. The performance is compared using three different training regimes, illustrated in Fig. 9: Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). Figure 10 shows that transferring to other cities is possible. Also, training the model on more cities would increase its effectiveness. According to the paper: "Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone". Training the model in a single city using skip connection is useful. However, it is not useful in multi-city transferring.
[[File:figure9-soroush.png|400px|thumb|center|Figure 9. Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimized).]]
[[File:figure10-soroush.png|400px|thumb|center|Figure 10. Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent evaluated only on the target city (Wall Street). We compare single-city training on the target environment alone vs. joint training on multiple cities (3, 4, or 5-way joint training including Wall Street), vs. pre-training on multiple cities and then transferring to Wall Street while freezing the entire agent except for the new pathway (see Fig. 10). One variant has skip connections between the convolutional encoder and the policy LSTM, the other does not (no-skip).]]

Giving early rewards before agent reaches the goal or adding random rewards (coins) to encourage exploration is investigated in this article. Figure 11a suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. Also, as it is clear from Figure 11b, reducing the density of the landmarks does not seem to reduce the performance. Based on the results, authors chose to start sampling the goal within a radius of 500m from the agent’s location, and then progressively extend it to the maximum distance an agent could travel within the environment. In addition, to asses the importance of the goal-conditioned agents, a Goal-less CityNav agent is trained by removing inputs gt. The poor performance of this agent is clear in Figure 11b. Furthermore, reducing the density of the landmarks by the ratio of 50%, 25%, and 12:5% does not reduce the performance that much. Finally, some alternative for goal representation is investigated:

a) Latitude and longitude scalar coordinates normalized to be between 0 and 1.

b) Binned representation.

The latitude and longitude scalar goal representations perform the best. However, since the all landmarks representation performs well while remaining independent of the coordinate system, we use this representation as the canonical one.

[[File:figure11-soroush.PNG|300px|thumb|center|Figure 11. Top: Learning curves of the CityNav agent on NYU, comparing reward shaping with different radii of early rewards (ER) vs. ER with random coins vs. curriculum learning with ER 200m and no coins (ER 200m, Curr.). Bottom: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.]]

==Conclusion==
In this paper, a deep reinforcement learning approach that enables navigation in cities is presented. Furthermore, a new courier task and a multi-city neural network agent architecture that is able to be transferred to new cities is discussed.

==Critique==
1. It is not clear that how this model is applicable in the real world. A real-world navigation problem needs to detect objects, people, and cars. However, it is not clear whether they are modeling them or not. From what I understood, they did not care about the collision, which is against their claim that it is a real-world problem.

2. This paper is only using static google street view images as its primary source of data. But the authors must at least complement this with other dynamic data like traffic and road blockage information for a realistic model of navigation in the world.

3. The 'Transfer in Multi-City Experiments' results could strengthened significantly from cross-validation (only Wall Street, which covers the smallest area of the four regions, is used as the test case). Additionally, the results do not show true 'multi-city' transfer learning, since all regions are within New York City. It is stated in the paper that not having to re-learn visual representations when transferring between cities is one of the outcomes, but the tests do not actually check for this. There are likely significant differences in the features that would be learned in NYC vs. Waterloo, for example, and this type of transfer has not been evaluated.

Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

2018-11-20T05:35:44Z

Vrajendr: /* Related work */

==Introduction==

The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.

This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets.

Like every other process, the process of collecting real world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors:

1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes

2. Collect training data in 6 different homes and testing data in 3 homes

3. Propose a method for modelling the noise in the labelled data

4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation

[[File:aa1.PNG|600px|thumb|center|]]

==Overview==

This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.

As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an on-board laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.

As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.

==Learning on low-cost robot data==

This paper uses patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.

===Grasping Formulation===

Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used.

Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.

(Note: On Cross Entropy:

If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool y . This is optimal, in that we can't encode the symbols using fewer bits on average.
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:

\begin{align}
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}
\end{align}

Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)

===Modeling noise as latent variable===

In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2:

[[File:aa2.PNG|600px|thumb|center|]]

The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.

The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:

\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]

Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.

===Learning the latent noise model===

They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. They used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location.. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.

===Training details===

They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.

==Results==

In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.

[[File:aa3.PNG|600px|thumb|center|]]

To evaluate their approach in a more quantitative way, they used three test settings:

- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.

- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.

- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.

They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.

===Experiment 1: Performance on held-out data===

Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.

[[File:aa4.PNG|600px|thumb|center|]]

===Experiment 2: Performance on Real LCA Robot===

In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high quality sensing.

[[File:aa5.PNG|600px|thumb|center|]]

===Performance on Real Sawyer===

To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for bench-marking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.

[[File:aa6.PNG|600px|thumb|center|]]

[[File:aa7.PNG|600px|thumb|center|]]

==Related work==

Over the last few years, the interest of scaling up robot learning with large scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.

Furthermore, since grasping is one of the basic problems in robotics, there were some efforts to improve grasping. Classical approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high quality depth as input which seems to be a barrier for practical use of robots in real environments. High quality depth sensing means a high cost to implement in hardware and thus is a barrier for practical use.

Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.

Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.

==Conclusion==

All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.

==Critiques==

This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.

Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is in fact a fine tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.

[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]

The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.

They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.

==References==

#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor critic for image-based robot learning." Robotics Science and Systems, 2018.
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.

Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

2018-11-20T05:31:50Z

Vrajendr: /* Experiment 2: Performance on Real LCA Robot */

==Introduction==

The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.

This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets.

Like every other process, the process of collecting real world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors:

1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes

2. Collect training data in 6 different homes and testing data in 3 homes

3. Propose a method for modelling the noise in the labelled data

4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation

[[File:aa1.PNG|600px|thumb|center|]]

==Overview==

This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.

As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an on-board laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.

As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.

==Learning on low-cost robot data==

This paper uses patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.

===Grasping Formulation===

Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used.

Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.

(Note: On Cross Entropy:

If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool y . This is optimal, in that we can't encode the symbols using fewer bits on average.
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool <math> {\hat h}</math> . This consists of encoding the <math> {i_{th}}</math> symbol using <math> {\log(\frac{1}{{\hat h_i}})}</math> bits instead of <math> {\log(\frac{1}{{ h_i}})}</math> bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols:

\begin{align}
H(y,\hat y) = \sum_i{y_i\log{\frac{1}{\hat y_i}}}
\end{align}

Cross entropy is always larger than entropy; encoding symbols according to the wrong distribution <math> {\hat y}</math> will always make us use more bits. The only exception is the trivial case where y and <math> {\hat y}</math> are equal, and in this case entropy and cross entropy are equal.)

===Modeling noise as latent variable===

In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2:

[[File:aa2.PNG|600px|thumb|center|]]

The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.

The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:

\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]

Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.

===Learning the latent noise model===

They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. They used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location.. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.

===Training details===

They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.

==Results==

In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.

[[File:aa3.PNG|600px|thumb|center|]]

To evaluate their approach in a more quantitative way, they used three test settings:

- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.

- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.

- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.

They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.

===Experiment 1: Performance on held-out data===

Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.

[[File:aa4.PNG|600px|thumb|center|]]

===Experiment 2: Performance on Real LCA Robot===

In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high quality depth sensing, cannot perform well in these scenarios. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high quality sensing.

[[File:aa5.PNG|600px|thumb|center|]]

===Performance on Real Sawyer===

To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for bench-marking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.

[[File:aa6.PNG|600px|thumb|center|]]

[[File:aa7.PNG|600px|thumb|center|]]

==Related work==

Over the last few years, the interest of scaling up robot learning with large scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.

Furthermore, since grasping is one of the basic problems of robotic, there were some efforts to improve grasping. Classic approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high quality depth as input which seems to be a barrier for practical use of robots in real environments.

Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.

Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.

==Conclusion==

All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.

==Critiques==

This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.

Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is in fact a fine tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.

[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]

The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.

They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.

==References==

#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor critic for image-based robot learning." Robotics Science and Systems, 2018.
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.

Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

2018-11-20T05:30:01Z

Vrajendr: /* Experiment 1: Performance on held-out data */

==Introduction==

The use of data-driven approaches in robotics has increased in the last decade. Instead of using hand-designed models, these data-driven approaches work on large-scale datasets and learn appropriate policies that map from high-dimensional observations to actions. Since collecting data using an actual robot in real-time is very expensive, most of the data-driven approaches in robotics use simulators in order to collect simulated data. The concern here is whether these approaches have the capability to be robust enough to domain shift and to be used for real-world data. It is an undeniable fact that there is a wide reality gap between simulators and the real world.

This has motivated the robotics community to increase their efforts in collecting real-world physical interaction data for a variety of tasks. This effort has been accelerated by the declining costs of hardware. This approach has been quite successful at tasks such as grasping, pushing, poking and imitation learning. However, the major problem is that the performance of these learning models is not good enough and tends to plateau fast. Furthermore, robotic action data did not lead to similar gains in other areas such as computer vision and natural language processing. As the paper claimed, the solution for all of these obstacles is using “real data”. Current robotic datasets lack diversity of environment. Learning-based approaches need to move out of simulators in the labs and go to real environments such as real homes so that they can learn from real datasets.

Like every other process, the process of collecting real world data is made difficult by a number of problems. First, there is a need for cheap and compact robots to collect data in homes but current industrial robots (i.e. Sawyer and Baxter) are too expensive. Secondly, cheap robots are not accurate enough to collect reliable data. Also, there is a lack of constant supervision for data collection in homes. Finally, there is also a circular dependency problem in home-robotics: there is a lack of real-world data which are needed to improve current robots, but current robots are not good enough to collect reliable data in homes. These challenges in addition to some other external factors will likely result in noisy data collection. In this paper, a first systematic effort has been presented for collecting a dataset inside homes. In accomplishing this goal, the authors:

1. Build a cheap robot costing less than USD 3K which is appropriate for use in homes

2. Collect training data in 6 different homes and testing data in 3 homes

3. Propose a method for modelling the noise in the labelled data

4. Demonstrate that the diversity in the collected data provides superior performance and requires little-to-no domain adaptation

[[File:aa1.PNG|600px|thumb|center|]]

==Overview==

This paper emphasizes the importance of diversifying the data for robotic learning in order to have a greater generalization, by focusing on the task of grasping. A diverse dataset also allows for removing biases in the data. By considering these facts, the paper argues that even for simple tasks like grasping, datasets which are collected in labs suffer from strong biases such as simple backgrounds and same environment dynamics. Hence, the learning approaches cannot generalize the models and work well on real datasets.

As a future possibility, there would be a need for having a low-cost robot to collect large-scale data inside a huge number of homes. For this reason, they introduced a customized mobile manipulator. They used a Dobot Magician which is a robotic arm mounted on a Kobuki which is a low-cost mobile robot base equipped with sensors such as bumper contact sensors and wheel encoders. The resulting robot arm has five degrees of freedom (DOF) (x, y, z, roll, pitch). The gripper is a two-fingered electric gripper with a 0.3kg payload. They also add an Intel R200 RGBD camera to their robot which is at a height of 1m above the ground. An Intel Core i5 processor is also used as an on-board laptop to perform all the processing. The whole system can run for 1.5 hours with a single charge.

As there is always a trade-off, when we gain a low-cost robot, we are actually losing accuracy for controlling it. So, the low-cost robot which is built from cheaper components than the expensive setups such as Baxter and Sawyer suffers from higher calibration errors and execution errors. This means that the dataset collected with this approach is diverse and huge but it has noisy labels. To illustrate, consider when the robot wants to grasp at location <math> {(x, y)}</math>. Since there is a noise in the execution, the robot may perform this action in the location <math> {(x + \delta_{x}, y+ \delta_{y})}</math> which would assign the success or failure label of this action to a wrong place. Therefore, to solve the problem, they used an approach to learn from noisy data. They modeled noise as a latent variable and used two networks, one for predicting the noise and one for predicting the action to execute.

==Learning on low-cost robot data==

This paper uses patch grasping framework in its proposed architecture. Also, as mentioned before, there is a high tendency for noisy labels in the datasets which are collected by inaccurate and cheap robots. The cause of the noise in the labels could be due to the hardware execution error, inaccurate kinematics, camera calibration, proprioception, wear, and tear, etc. Here are more explanations about different parts of the architecture in order to disentangle the noise of the low-cost robot’s actual and commanded executions.

===Grasping Formulation===

Planar grasping is the object of interest in this architecture. It means that all the objects are grasped at the same height and vertical to the ground (ie: a fixed end-effector pitch). The final goal is to find <math>{(x, y, \theta)}</math> given an observation <math> {I}</math> of the object, where <math> {x}</math> and <math> {y}</math> are the translational degrees of freedom and <math> {\theta}</math> is the rotational degrees of freedom (roll of the end-effector). For the purpose of comparison, they used a model which does not predict the <math>{(x, y, \theta)}</math> directly from the image <math> {I}</math>, but samples several smaller patches <math> {I_{P}}</math> at different locations <math>{(x, y)}</math>. Thus, the angle of grasp <math> {\theta}</math> is predicted from these patches. Also, in order to have multi-modal predictions, discrete steps of the angle <math> {\theta}</math>, <math> {\theta_{D}}</math> is used.

Hence, each datapoint consists of an image <math> {I}</math>, the executed grasp <math>{(x, y, \theta)}</math> and the grasp success/failure label g. Then, the image <math> {I}</math> and the angle <math> {\theta}</math> are converted to image patch <math> {I_{P}}</math> and angle <math> {\theta_{D}}</math>. Then, to minimize the classification error, a binary cross entropy loss is used which minimizes the error between the predicted and ground truth label <math> g </math>. A convolutional neural network with weight initialization from pre-training on Imagenet is used for this formulation.

===Modeling noise as latent variable===

In order to tackle the problem of inaccurate position control and calibration due to cheap robot, they found a structure in the noise which is dependent on the robot and the design. They modeled this structure of noise as a latent variable and decoupled during training. The approach is shown in figure 2:

[[File:aa2.PNG|600px|thumb|center|]]

The grasp success probability for image patch <math> {I_{P}}</math> at angle <math> {\theta_{D}}</math> is represented as <math> {P(g|I_{P},\theta_{D}; \mathcal{R} )}</math> where <math> \mathcal{R}</math> represents environment variables that can add noise to the system.

The conditional probability of grasping at a noisy image patch <math>I_P</math> for this model is computed by:

\[ { P(g|I_{P},\theta_{D}, \mathcal{R} ) = ∑_{( \widehat{I_P} \in \mathcal{P})} P(g│z=\widehat{I_P},\theta_{D},\mathcal{R}) \cdot P(z=\widehat{I_P} | \theta_{D},I_P,\mathcal{R})} \]

Here, <math> {z}</math> models the latent variable of the actual patch executed, and <math>\widehat{I_P}</math> belongs to a set of possible neighboring patches <math> \mathcal{P}</math>.<math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R})</math> shows the noise which can be caused by <math>\mathcal{R}</math> variables and is implemented as the Noise Modelling Network (NMN). <math> {P(g│z=\widehat{I_P},\theta_{D}, \mathcal{R} )}</math> shows the grasp prediction probability given the true patch and is implemented as the Grasp Prediction Network (GPN). The overall Robust-Grasp model is computed by marginalizing GPN and NMN.

===Learning the latent noise model===

They assume that <math> {z}</math> is conditionally independent of the local patch-specific variables <math> {(I_{P}, \theta_{D})}</math>. To estimate the latent variable <math> {z}</math> given the global information <math>\mathcal{R}</math>, i.e <math> P(z=\widehat{I_P}|\theta_D,I_P,\mathcal{R}) \equiv P(z=\widehat{I_P}|\mathcal{R})</math>. They used direct optimization to learn both NMN and GPN with noisy labels. The entire image of the scene and the environment information are the inputs of the NMN, as well as robot ID and raw-pixel grasp location.. The output of the NMN is the probability distribution of the actual patches where the grasps are executed. Finally, a binary cross entropy loss is applied to the marginalized output of these two networks and the true grasp label g.

===Training details===

They implemented their model in PyTorch using a pretrained ResNet-18 model. They concatenated 512 dimensional ResNet feature with a 1-hot vector of robot ID and the raw pixel location of the grasp for their NMN. Also, the inputs of the GPN are the original noisy patch plus 8 other equidistant patches from the original one.
Their training process starts with training only GPN over 5 epochs of the data. Then, the NMN and the marginalization operator are added to the model. So, they train NMN and GPN simultaneously for the other 25 epochs.

==Results==

In the results part of the paper, they show that collecting dataset in homes is essential for generalizing learning from unseen environments. They also show that modelling the noise in their Low-Cost Arm (LCA) can improve grasping performance.
They collected data in parallel using multiple robots in 6 different homes, as shown in Figure 3. They used an object detector (tiny-YOLO) as the input data were unstructured due to LCA limited memory and computational capabilities. With an object location detected, class information was discarded, and a grasp was attempted. The grasp location in 3D was computed using PointCloud data. They scattered different objects in homes within 2m area to prevent collision of the robot with obstacles and let the robot move randomly and grasp objects. Finally, they collected a dataset with 28K grasp results.

[[File:aa3.PNG|600px|thumb|center|]]

To evaluate their approach in a more quantitative way, they used three test settings:

- The first one is a binary classification or held-out data. The test set is collected by performing random grasps on objects. They measure the performance of binary classification by predicting the success or failure of grasping, given a location and the angle. Using binary classification allows for testing a lot of models without running them on real robots. They collected two held-out datasets using LCA in lab and homes and the dataset for Baxter robot.

- The second one is Real Low-Cost Arm (Real-LCA). Here, they evaluate their model by running it in three unseen homes. They put 20 new objects in these three homes in different orientations. Since the objects and the environments are completely new, this tests could measure the generalization of the model.

- The third one is Real Sawyer (Real-Sawyer). They evaluate the performance of their model by running the model on the Sawyer robot which is more accurate than the LCA. They tested their model in the lab environment to show that training models with the datasets collected from homes can improve the performance of models even in lab environments.

They used baselines for both their data which is collected in homes and their model which is Robust-Grasp. They used two datasets for the baseline. The dataset collected by (Lab-Baxter) and the dataset collected by their LCA in the lab (Lab-LCA).
They compared their Robust-Grasp model with the noise independent patch grasping model (Patch-Grasp) [4]. They also compared their data and model with DexNet-3.0 (DexNet) for a strong real-world grasping baseline.

===Experiment 1: Performance on held-out data===

Table 1 shows that the models trained on lab data cannot generalize to the Home-LCA environment (i.e. they overfit to their respective environments and attain a lower binary classification score). However, the model trained on Home-LCA has a good performance on both lab data and home environment.

[[File:aa4.PNG|600px|thumb|center|]]

===Experiment 2: Performance on Real LCA Robot===

In table 2, the performance of the Home-LCA is compared against a pre-trained DexNet and the model trained on the Lab-Baxter. Training on the Home-LCA dataset performs 43.7% better than training on the Lab-Baxter dataset and 33% better than DexNet. The low performance of DexNet can be described by the possible noise in the depth images that are caused by the natural light. DexNet, which requires high quality depth sensing, cannot perform well. By using cheap commodity RGBD cameras in LCA, the noise in the depth images is not a matter of concern, as the model has no expectation of high quality.

[[File:aa5.PNG|600px|thumb|center|]]

===Performance on Real Sawyer===

To compare the performance of the Robust-Grasp model against the Patch-Grasp model without collecting noise-free data, they used Lab-Baxter for bench-marking, which is an accurate and better calibrated robot. The Sawyer robot is used for testing to ensure that the testing robot is different from both training robots. As shown in Table 3, the Robust-Grasp model trained on Home-LCA outperforms the Patch-Grasp model and achieves 77.5% accuracy. This accuracy is similar to several recent papers, however, this model was trained and tested in different environment. The Robust-Grasp model also outperforms the Patch-Grasp by about 4% on binary classification. Furthermore, the visualizations of predicted noise corrections in Figure 4 shows that the corrections depend on both the pixel locations of the noisy grasp and the robot.

[[File:aa6.PNG|600px|thumb|center|]]

[[File:aa7.PNG|600px|thumb|center|]]

==Related work==

Over the last few years, the interest of scaling up robot learning with large scale datasets has been increased. Hence, many papers were published in this area. A hand annotated grasping dataset, a self-supervised grasping dataset, and grasping using reinforcement learning are some examples of using large scale datasets for grasping. The work mentioned above used high-cost hardware and data labeling mechanisms. There were also many papers that worked on other robotic tasks like material recognition, pushing objects and manipulating a rope. However, none of these papers worked on real data in real environments like homes, they all used lab data.

Furthermore, since grasping is one of the basic problems of robotic, there were some efforts to improve grasping. Classic approaches focused on physics-based issues of grasping and required 3D models of the objects. However, recent works focused on data-driven approaches which learn from visual observations to grasp objects. Simulation and real-world robots are both required for large-scale data collection. A versatile grasping model was proposed to achieve a 90% performance for a bin-picking task. The point here is that they usually require high quality depth as input which seems to be a barrier for practical use of robots in real environments.

Most labs use industrial robots or standard collaborative hardware for their experiments. Therefore, there is few research that used low cost robots. One of the examples is learning using a cheap inaccurate robot for stack multiple blocks. Although mobile robots like iRobot’s Roomba have been in the home consumer electronics market for a decade, it is not clear whether learning approaches are used in it alongside mapping and planning.

Learning from noisy inputs is another challenge specifically in computer vision. A controversial question which is often raised in this area is whether learning from noise can improve the performance. Some works show it could have bad effects on the performance; however, some other works find it valuable when the noise is independent or statistically dependent on the environment. In this paper, they used a model that can exploit the noise and learn a better grasping model.

==Conclusion==

All in all, the paper presents an approach for collecting large-scale robot data in real home environments. They implemented their approach by using a mobile manipulator which is a lot cheaper than the existing industrial robots. They collected a dataset of 28K grasps in six different homes. In order to solve the problem of noisy labels which were caused by their inaccurate robots, they presented a framework to factor out the noise in the data. They tested their model by physically grasping 20 new objects in three new homes and in the lab. The model trained with home dataset showed 43.7% improvement over the models trained with lab data. Their results also showed that their model can improve the grasping performance even in lab environments. They also demonstrated that their architecture for modeling the noise improved the performance by about 10%.

==Critiques==

This paper does not contain a significant algorithmic contribution. They are just combining a large number of data engineering techniques for the robot learning problem. The authors claim that they have obtained 43.7% more accuracy than baseline models, but it does not seem to be a fair comparison as the data collection happened in simulated settings in the lab for other methods, whereas the authors use the home dataset. The authors must have also discussed safety issues when training robots in real environments as against simulated environments like labs. The authors are encouraging other researchers to look outside the labs, but are not discussing the critical safety issues in this approach.

Another strange finding is that the paper mentions that they "follow a model architecture similar to [Pinto and Gupta [4]]," however, the proposed model is in fact a fine tuned resnet-18 architecture. Pinto and Gupta, implement a version similar to AlexNet as shown below in Figure 5.

[[File:Figure_5_PandG.JPG | 450px|thumb|center|Figure 5: AlexNet architecture implemented in Pinto and Gupta [4].]]

The paper argues that the dataset collected by the LCA is noisy, since the robot is cheap and inaccurate. It further asserts that in order to handle the noise in the dataset, they can model the noise as a latent variable and their model can improve the performance of grasping. Although learning from noisy data and achieving a good performance is valuable, it is better that they test their noise modeling network for other robots as well. Since their noise modelling network takes robot information as an input, it would be a good idea to generalize it by testing it using different inaccurate robots to ensure that it would perform well.

They did not mention other aspects of their comparison, for example they could mention their training time compared to other models or the size of other datasets.

==References==

#Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017. URL https://arxiv.org/abs/1703.06907.
#Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." arXiv preprint arXiv:1710.06537,2017.
#Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. "Asymmetric actor critic for image-based robot learning." Robotics Science and Systems, 2018.
#Lerrel Pinto and Abhinav Gupta. "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours." CoRR, abs/1509.06825, 2015. URL http://arxiv.org/abs/1509. 06825.
#Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. "CASSL: Curriculum accelerated self-supervised learning." International Conference on Robotics and Automation, 2018.
# Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. "End-to-end training of deep visuomotor policies." The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
#Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye coordination for robotic grasping with deep learning and large scale data collection." CoRR, abs/1603.02199, 2016. URL http://arxiv.org/abs/1603.02199.
#Pulkit Agarwal, Ashwin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Learning to poke by poking: Experiential learning of intuitive physics." 2016. URL http://arxiv.org/ abs/1606.07419
#Chelsea Finn, Ian Goodfellow, and Sergey Levine. "Unsupervised learning for physical interaction through video prediction." In Advances in neural information processing systems, 2016.
#Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. "Combining self-supervised learning and imitation for vision-based rope manipulation." International Conference on Robotics and Automation, 2017.
#Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. "Revisiting unreasonable effectiveness of data in deep learning era." ICCV, 2017.

MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION

2018-11-15T02:14:08Z

Vrajendr: /* Experiments and Results */

This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==

===Motivation===
High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same.

===Related Work===

The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of the same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view (i.e. methods based on unlabeled samples), also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps in the learning of such model, yet prevents their use on many datasets where this information is not available.

===Contributions===

The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.

==Paper Overview==

===Background===

The paper uses the concept of the popular GAN (Generative Adverserial Networks) proposed by Goodfellow et al.(2014).

GENERATIVE ADVERSARIAL NETWORK:

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs was introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”

Let us denote <math>X</math> an input space composed of multidimensional samples x e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples G(z) where <math>z ∼ p_z</math>. A GAN defines, in addition to G, a discriminator function D : X → [0; 1] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled following <math>p_G</math>, while the generator is learned to fool the discriminator D. Usually both G and D are implemented with neural networks. The objective function is based on the following adversarial criterion:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div>

where <math>p_x</math> is the empirical data distribution on X .
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.

CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:

In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. Now, to add such a condition to both generator and discriminator, we will simply feed some vector y, into both networks. Hence, both the discriminator D(X,y) and generator G(z,y) are jointly distributed with y.

Now, the objective function of CGAN is:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div>

The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor z and generating x only based on the condition y, exhibiting an almost deterministic behavior. At this point, the CGAN also fails to produce a satisfying amount of diversity in generated samples.

===Generative Multi-View Model===

''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.

The paper defines two tasks here to be done:
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axes,
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labeling information.

The paper introduces the vectors '''c''' and '''v''' to represent latent vectors in R<sup>c</sup> and R<sup>v</sup>

''' Generative Multi-view Model: '''

Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. The objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).

The key idea that authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from the same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.

Now, the objective function of GMV Model is:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div>

Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view

<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div>

===Conditional Generative Model (C-GMV)===

C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model. This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math>

Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN is known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.

The Objective function for C-GMV will be:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div>

<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div>

==Experiments and Results==

The authors have given an exhaustive set of results and experiments.

Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used is if two samples correspond to the same object or not.

<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div>

Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015). The encoder E is similar to D with the only differences being the batch-normalization in the first layer and the last layer which doesn't have a non-linearity. The Adam optimizer was used, with a batch size of 128. The learning rates for G and D were set to 1*10<sup>-3</sup> and 2*10<sup>-4</sup> respectively for the GMV experiments. In the C-GMV experiments, learning rates of 5*10<sup>-5</sup> were used. Alternating gradient descent was used to optimize the different objectives of the network components (generator, encoder and discriminator).

Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models, authors have built baselines working in the same conditions as the models in this paper. In addition, models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views.

===Generating Multiple Contents and Views===

Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by the DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.

<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div>

<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div>

Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets

<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div>

Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view
does not modify the content and vice versa.

<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div>

===Generating Multiple Views of a Given Object===

The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row, the input sample is shown in the left column. New views are generated from that input and shown to the right, with those generated from C_GMV in the centre, and those generated from CGAN on the far right.

<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div>

<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div>

=== Evaluation of the Quality of Generated Samples ===

There are usually several metrics to evaluate generative models. Some of them are:
<ol>
<li>Inception Score</li>
<li>Latent Space Interpolation</li>
<li>log-likelihood (LL) score</li>
<li> minimum description length (MDL) score</li>
<li>minimum message length (MML) score</li>
<li>Akaike Information Criterion (AIC) score</li>
<li>Bayesian Information Criterion (BIC) score</li>
</ol>

The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.

<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div>

<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div>

==Conclusion==

The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors. Here, the paper showed that the application of architecture search to dense image prediction was achieved through a) The construction of a recursive search space leveraging innovation in the dense prediction literature b) construction of a fast proxy predictive of a large task. The learned architecture was shown to surpass human invented architectures across three dense image prediction tasks i.e scene parsing, person part segmentation and semantic segmentation.

==Future Work==
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings.

==Critique==

The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.

However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.

==References==

[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018

[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.

[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.

[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.

MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION

2018-11-15T02:10:42Z

Vrajendr: /* Introduction */

This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==

===Motivation===
High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same.

===Related Work===

The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of the same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view (i.e. methods based on unlabeled samples), also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps in the learning of such model, yet prevents their use on many datasets where this information is not available.

===Contributions===

The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.

==Paper Overview==

===Background===

The paper uses the concept of the popular GAN (Generative Adverserial Networks) proposed by Goodfellow et al.(2014).

GENERATIVE ADVERSARIAL NETWORK:

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs was introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”

Let us denote <math>X</math> an input space composed of multidimensional samples x e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples G(z) where <math>z ∼ p_z</math>. A GAN defines, in addition to G, a discriminator function D : X → [0; 1] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled following <math>p_G</math>, while the generator is learned to fool the discriminator D. Usually both G and D are implemented with neural networks. The objective function is based on the following adversarial criterion:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div>

where <math>p_x</math> is the empirical data distribution on X .
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.

CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:

In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. Now, to add such a condition to both generator and discriminator, we will simply feed some vector y, into both networks. Hence, both the discriminator D(X,y) and generator G(z,y) are jointly distributed with y.

Now, the objective function of CGAN is:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div>

The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor z and generating x only based on the condition y, exhibiting an almost deterministic behavior. At this point, the CGAN also fails to produce a satisfying amount of diversity in generated samples.

===Generative Multi-View Model===

''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.

The paper defines two tasks here to be done:
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axes,
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labeling information.

The paper introduces the vectors '''c''' and '''v''' to represent latent vectors in R<sup>c</sup> and R<sup>v</sup>

''' Generative Multi-view Model: '''

Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. The objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).

The key idea that authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from the same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.

Now, the objective function of GMV Model is:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div>

Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view

<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div>

===Conditional Generative Model (C-GMV)===

C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model. This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math>

Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN is known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.

The Objective function for C-GMV will be:

<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \ \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div>

<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div>

==Experiments and Results==

The authors have given an exhaustive set of results and experiments.

Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used is if two samples correspond to the same object or not.

<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div>

Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015). The Adam optimiser was used, with a batch size of 128. the learning rates for G and D were set to 1*10<sup>-3</sup> and 2*10<sup>-4</sup> respectively for the GMV experiments. In the C-GMV experiments, learning rates of 5*10<sup>-5</sup> were used. Alternating gradient descent was used to optimize the different objectives of the network components.

Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models, authors have built baselines working in the same conditions as the models in this paper. In addition, models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views.

===Generating Multiple Contents and Views===

Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by the DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.

<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div>

<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div>

Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets

<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div>

Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view
does not modify the content and vice versa.

<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div>

===Generating Multiple Views of a Given Object===

The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row, the input sample is shown in the left column. New views are generated from that input and shown to the right, with those generated from C_GMV in the centre, and those generated from CGAN on the far right.

<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div>

<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div>

=== Evaluation of the Quality of Generated Samples ===

There are usually several metrics to evaluate generative models. Some of them are:
<ol>
<li>Inception Score</li>
<li>Latent Space Interpolation</li>
<li>log-likelihood (LL) score</li>
<li> minimum description length (MDL) score</li>
<li>minimum message length (MML) score</li>
<li>Akaike Information Criterion (AIC) score</li>
<li>Bayesian Information Criterion (BIC) score</li>
</ol>

The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.

<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div>

<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div>

==Conclusion==

The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors. Here, the paper showed that the application of architecture search to dense image prediction was achieved through a) The construction of a recursive search space leveraging innovation in the dense prediction literature b) construction of a fast proxy predictive of a large task. The learned architecture was shown to surpass human invented architectures across three dense image prediction tasks i.e scene parsing, person part segmentation and semantic segmentation.

==Future Work==
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings.

==Critique==

The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.

However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.

==References==

[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018

[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.

[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.

[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.

Reinforcement Learning of Theorem Proving

2018-11-15T00:52:49Z

Vrajendr: /* Conclusions */

== Introduction ==
Automated reasoning over mathematical proof was a major motivation for the development of computer science. Automated theorem provers (ATP) can in principle be used to attack any formally stated mathematical problem and is a research area that has been present since the early 20th century [1]. As of today, state-of-art ATP systems rely on the fast implementation of complete proof calculi. such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result,
in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.

In this paper, the authors propose a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi are the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It is shown that reinforcement learning results in a 42.1% performance increase compared to the base prover (without learning).

== Related Work ==
C. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Such systems are known to only learn a high-level selection of relevant facts from a large knowledge base and delegate the internal proof search to standard ATP systems. S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However, they chose deep neural network (CNNs and RNNs) as feature extractor. These systems are treated as black boxes in literature with not much understanding of their performances possible.

Some other works add Monte Carlo tree search to connection tableau, without reinforcement learning iterations, with complete backtracking and without learned value. This is closest to the authors' approach but the performance is poorer than this paper.

On a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [5], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights into how to process datasets of formally stated mathematical problems.

== First Order Logic and Connection Calculi ==
Here we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. Let us try to prove the following first-order sentence.

[[file:fof_sentence.png|frameless|center]]

This sentence can be transformed into a formula in Skolemized Disjunctive Normal Form (DNF), which is referred to as the "matrix".

[[file:skolemized_dnf.png|frameless|center]]
[[file:matrix.png|frameless|center]]

The original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempt to show that the Skolemized DNF formula is a tautology by constructing a tableau. We will start at the special node, root, which is an open leaf. At each step, we select a clause (for example, clause <math display="inline">P \wedge R</math> is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from the root to this leaf. If two literals on this path are unifiable (for example, <math display="inline">Qx'</math> is unifiable with <math display="inline">\neg Qc</math>), this leaf is then closed. In standard terminology, it states that a connection is found on this branch.

[[file:tableaux_example.png|thumb|center|Figure 1. An example of closed tableaux. Adapted from [2]]]

The paper's goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, the paper has shown that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.

In formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as <math display="inline">subgoal, M, path</math> where <math display="inline">subgoal</math> is a list of literals that we need to find connection later, <math display="inline">M</math> stands for the matrix, and <math display="inline">path</math> stands for the path leading to this leaf.

[[file:formal_calculi.png|thumb|center|Figure 2. Formal connection calculi. Adapted from [2].]]
[[file:formal_tableaux.png|thumb|center|Figure 3. Formal tableaux constructed from the example sentence. Adapted from [2].]]

To sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function ''prove(subgoal, M, path)'' is stated as follows:
* If ''subgoal'' is empty
** return ''TRUE''
* If reduction is possible
** Perform reduction, generating ''new_subgoal'', ''new_path''
** return ''prove(new_subgoal, M, new_path)''
* For all clauses in ''M''
** If a clause can do extension with ''subgoal''
** Perform extension, generating ''new_subgoal1'', ''new_path'', ''new_subgoal2''
** return ''prove(new_subgoal1, M, new_path)'' and ''prove(new_subgoal2, M, path)''
* return ''FALSE''

It is important to note that the bare prover implemented in this paper is incomplete. Here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause <math display="inline">P(0)</math> be the first subgoal. Clearly choosing <math display="inline">\neg P(0)</math> to extend will complete the proof.

[[file:pathological.png|frameless|center]]

However, if we choose <math display="inline">\neg P(x) \lor P(s(x))</math> to do extension, the algorithm will generate an infinite branch <math display="inline">P(0), P(s(0)), P(s(s(0))) ...</math>. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.

In addition, the provability of first-order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.

== Mizar Math Library ==
Mizar Math Library (MML) is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem "If <math display="inline"> p </math> is prime, then <math display="inline"> \sqrt p </math> is irrational."

[[file:mizar_article.png|thumb|center|Figure 3. An article from MML. Adapted from [6].]]

The training and testing data for this paper is a subset of MML, the Mizar40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar40 library, it states that with ''d3_xboole_0'' and ''t3_xboole_0'' as premises, we can prove ''t5_xboole_0''.

[[file:mizar40_0.png|frameless|center]]
[[file:mizar40_1.png|frameless|center]]
[[file:mizar40_2.png|frameless|center]]
[[file:mizar40_3.png|frameless|center]]

== Monte Carlo Guidance ==

Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. Then the expansion will then be used to weight the node in the search tree.

In the reinforcement learning setting, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state <math display="inline"> i </math> needs to maintain three parameters, its prior probability <math display="inline"> p_i </math>, its total reward <math display="inline"> w_i </math>, and number of its visits <math display="inline"> n_i </math>. If no policy learning is used, the prior probabilities are all equal to one.

A simple heuristic is used to estimate the future reward of leaf states: suppose leaf state <math display="inline"> i </math> has <math display="inline"> G_i </math> open subgoals, the reward is computed as <math display="inline"> 0.95 ^ {G_i} </math>. This will be replaced once value learning is implemented.

The standard UCT formula is chosen to select the next actions in the playouts
\begin{align}
{\frac{w_i}{n_i}} + 2 \cdot p_i \cdot {\sqrt{\frac{\log N}{n_i}}}
\end{align}
where <math display="inline"> N </math> stands for the total number of visits of the parent node.

The bare prover is asked to play <math display="inline"> b </math> playouts of length <math display="inline"> d </math> from the empty tableaux, each playout backpropagates the values of proof states it visits. After these <math display="inline"> b </math> playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next <math display="inline"> b </math> playouts will start from this tableaux, followed by another bigstep, etc.

== Policy Learning and Guidance ==

From many runs of MCT, we will know the prior probability of actions in particular proof states, we can extract the frequency of each action <math display="inline"> a </math>, and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion <math display="inline"> r_a \in (0, \infty) </math>. We characterize the proof states for policy learning by extracting human-engineered features. Also, we characterize actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector <math display="inline"> (f_s, f_a) </math>.

The feature vector <math display="inline"> (f_s, f_a) </math> is regressed against the associated <math display="inline"> r_a </math>.

During the proof search, the prior probabilities <math display="inline"> p_i </math> of available actions <math display="inline"> a_i </math> in a state <math display="inline"> s </math> is computed as the softmax of their predictions.

Training examples are only extracted from big step states, making the amount of training data manageable.

== Value Learning and Guidance ==

Bigstep states are also used for proof state evaluation. For a proof state <math display="inline"> s </math>, if it corresponds to a successful proof, the value is assigned as <math display="inline"> v_s = 1 </math>. If it corresponds to a failed proof, the value is assigned as <math display="inline"> v_s = 0 </math>. For other scenarios, denote the distance between state <math display="inline"> s </math> and a successful state as <math display="inline"> d_s </math>, then the value is assigned as <math display="inline"> v_s = 0.99^{d_s} </math>

Proof state feature <math display="inline"> f_s </math> is regressed against the value <math display="inline"> v_s </math>. During the proof search, the reward of leaf states are computed from this prediction.

== Features and Learners ==
For proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the resulting integer is reduced to a smaller feature space by taking modulo by a large prime.

For actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.

In addition to the term walks, they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.

The whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.

== Experimental Results ==
The authors split Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, ''mlCoP'' represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and ''bare prover'' stands for the prover implemented in this paper, without MCT guidance.

[[file:atp_result0.jpg|frane|center]]
[[file:atp_result1.jpg|frame|center|Figure 4. Experimental result on Mizar40 dataset]]

As shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]

== Conclusions ==
In this work, the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]

The authors pose that some future research could include strong learning algorithms to characterize mathematical data. The development of suitable deep learning architectures will help the algorithm characterize semantic and syntactic features of mathematical objects which will be crucial to create strong assistants for mathematics and hard sciences.

== Critiques ==
Until now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.

The feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior to human-engineered features, which is also shown in [4, 5].

Also, the policy-value learning iteration is quite inefficient. The learning loop is:
* Loop
** Run MCT with the previous model on an entire dataset
** Collect MCT data
** Train a new model
If we adopt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.

The experimental design of this paper has some flaws. The authors compare the performance of ''mlCoP'' and ''rlCoP'' by limiting them to the same number of inference steps. However, every inference step of ''rlCoP'' requires additional machine learning prediction, which costs more time. A better way to compare their performance is to set a time limit.

It would also be interesting to study automated theorem proving in another logic system, like high order logic.

== References ==
[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.

[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.

[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.

[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.

[5] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.

[6] Mizar Math Library. http://mizar.org/library/

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-11-08T04:15:27Z

Vrajendr: /* 1) Classifying Indoor/Outdoor */

=Introduction=

In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seeks immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals are not able to provide an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a recurrent neural network to classify whether a smartphone was either inside or outside of buildings. The second contribution is that they used the output of their previously trained classifier to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

=Related Work=

In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment context, environment mean bldg floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.

From [4] the data was collected as follows:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:collection.png|600px]] </div>

Their algorithm used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div>

The final output of the LSTM is a time-series <math> T = {t_1, t_2, ..., t_i, t_n} </math> where each <math> t_i = 0, t_i = 1 </math> if the point is outside or inside respectively.

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div>

The above chart shows the success that their system is able to achieve in floor level prediction.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-11-08T04:14:58Z

Vrajendr: /* 1) Classifying Indoor/Outdoor */

=Introduction=

In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seeks immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals are not able to provide an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a recurrent neural network to classify whether a smartphone was either inside or outside of buildings. The second contribution is that they used the output of their previously trained classifier to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

=Related Work=

In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment context, environment mean bldg floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.

From [4] the data was collected as follows:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:collection.png|600px]] </div>

Their algorithm used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div>

The final output of the LSTM is a time-series <math> T = {t1, t2, ..., ti, tn} </math> where each <math> ti = 0, ti = 1 </math> if the point is outside or inside respectively.

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div>

The above chart shows the success that their system is able to achieve in floor level prediction.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-11-08T04:08:32Z

Vrajendr: /* Method */

=Introduction=

In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seeks immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals are not able to provide an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a recurrent neural network to classify whether a smartphone was either inside or outside of buildings. The second contribution is that they used the output of their previously trained classifier to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

=Related Work=

In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data they designed and developed an iOS application (called Sensory) that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude. The app streamed data at 1 sample per second and each datum contained the different sensor measurements mentions earlier along with environment context, environment mean bldg floors, environment activity, city name, country name, magnet x, magnet y, magnet z, magnet total.

From [4] the data was collected as follows:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:collection.png|600px]] </div>

Their algorithm used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div>

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div>

The above chart shows the success that their system is able to achieve in floor level prediction.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Predicting Floor Level For 911 Calls with Neural Network and Smartphone Sensor Data

2018-11-08T04:07:07Z

Vrajendr: /* Method */

=Introduction=

In highly populated cities, where there are many buildings, locating individuals in the case of an emergency is an important task. For emergency responders, time is of essence. Therefore, accurately locating a 911 caller plays an integral role in this important process.

The motivation for this problem in the context of 911 calls: victims trapped in a tall building who seeks immediate medical attention, locating emergency personnel such as firefighters or paramedics, or a minor calling on behalf of an incapacitated adult.

In this paper, a novel approach is presented to accurately predict floor level for 911 calls by leveraging neural networks and sensor data from smartphones.

In large cities with tall buildings, relying on GPS or Wi-Fi signals are not able to provide an accurate location of a caller.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:17floor.png|250px]]
[[File:19floor.png|250px]]</div>

In this work, there are two major contributions. The first is that they trained a recurrent neural network to classify whether a smartphone was either inside or outside of buildings. The second contribution is that they used the output of their previously trained classifier to aid in predicting the change in the barometric pressure of the smartphone from once it entered the building to its current location. In the final part of their algorithm, they are able to predict the floor level by clustering the measurements of height.

=Related Work=

In general, previous work falls under two categories. The first category of methods is classification methods based on the user's activity.
Therefore, some current methods leverage the user's activity to predict which is based on the offset in their movement [2]. These activities include running, walking, and moving through the elevator.
The second set of methods focus more on the use of a barometer which measures the atmospheric pressure. As a result, utilizing a barometer can provide the changes in altitude.

Avinash Parnandi and his coauthors used multiple classifiers in the predicting the floor level [2]. The steps in their algorithmic process are:
<ol>
<li> Classifier to predict whether the user is indoors or outdoors</li>
<li> Classifier to identify if the activity of the user, i.e. walking, standing still etc. </li>
<li> Classifier to measure the displacement</li>
</ol>

One of the downsides of this work is that in order to achieve high accuracy the user's step size is needed, therefore heavily relying on pre-training to the specific user. In a real world application of this method this would not be practical.

Song and his colleagues model the way or cause of ascent. That is, was the ascent a result of taking the elevator, stairs or escalator [3]. Then by using infrastructure support of the buildings and as well as additional tuning they are able to predict floor level.
This method also suffers from relying on data specific to the building.

Overall, these methods suffer from relying on pre-training to a specific user, needing additional infrastructure support, or data specific to the building. The method proposed in this paper aims to predict floor level without these constraints.

=Method=

In their paper, the authors claim that to their knowledge "there does not exist a dataset for predicting floor heights" [4].

To collect data they designed and developed an iOS application that ran on an iPhone 6s to aggregate the data. They used the smartphone's sensors to record different features such as barometric pressure, GPS course, GPS speed, RSSI strength GPS longitude, GPS latitude, and altitude.

From [4] the data was collected as follows:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:collection.png|600px]] </div>

Their algorithm used to predict floor level is a 3 part process:

<ol>
<li> Classifying whether smartphone is indoor or outdoor </li>
<li> Indoor/Outdoor Transition detector</li>
<li> Estimating vertical height and resolving to absolute floor level </li>
</ol>

==1) Classifying Indoor/Outdoor ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:classifierfloor.png|800px]] </div>

From [5] they are using 6 features which were found through forests of trees feature reduction. The features are smartphone's barometric pressure, GPS vertical accuracy, GPS horizontal accuracy, GPS speed, device RSSI level, and magnetometer total reading.

The magnetometer total reading was calculated from given the 3-dimensional reading <math>x, y, z </math>

<div style="text-align: center;">Total Magnetic field strength <math>= \sqrt{x^{2} + y^{2} + z^{2}}</math></div>

They used a 3 layer LSTM where the inputs are <math> d </math> consecutive time steps. The output <math> y = 1 </math> if smartphone is indoor and <math> y = 0 </math> if smartphone is outdoor.

In their design they set <math> d = 3</math> by random search [6]. The point to make is that they wanted the network to learn the relationship given a little bit of information from both the past and future.

For the overall signal sequence: <math> \{x_1, x_2,x_j, ... , x_n\}</math> the aim is to classify <math> d </math> consecutive sensor readings <math> X_i = \{x_1, x_2, ..., x_d \} </math> as <math> y = 1 </math> or <math> y = 0 </math> as noted above.

This is a critical part of their system and they only focus on the predictions in the subspace of being indoors.

They have trained the LSTM to minimize the binary cross entropy between the true indoor state <math> y </math> of example <math> i </math>.

The cost function is shown below:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:costfunction.png|500px]] </div>

==2) Transition Detector ==

Given the predictions from the previous step, now the next part is to find when the transition of going in or out of a building has occurred.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:transition.png|400px]] </div>
In this figure, they convolve filters <math> V_1, V_2</math> across the predictions T and they pick a subset <math>s_i </math> such that the Jacard distance (defined below) is <math> >= 0.4 </math>
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:v1v2.png|300px]] </div>
Jacard Distance:
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:jacard.png|500px]]</div>

After this process, we are now left with a set of <math> b_i</math>'s describing the index of each indoor/outdoor transition. The process is shown in the first figure.

==3) Vertical height and floor level ==

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:resolvefloor.png|700px]] </div>

In the final part of the system, the vertical offset needs to be computed given the smartphone's last known location i.e. the last known transition which can easily be computed given the set of transitions from the previous step. All that needs to be done is to pull the index of most recent transition from the previous step and set <math> p_0</math> to the lowest pressure within a ~ 15-second window around that index.

The second parameter is <math> p_1 </math> which is the current pressure reading. In order to generate the relative change in height <math> m_\Delta</math>

After plugging this into the formula defined above we are now left with a scalar value which represents the height displacement between the entrance and the smartphone's current location of the building [7].

In order to resolve to an absolute floor level they use the index number of the clusters of <math> m_\Delta</math> 's. As seen above <math> 5.1 </math> is the third cluster implying floor number 3.

=Experiments and Results=

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ioaccuracy.png|500px]] </div>

All of these classifiers were trained and validated on data from a total of 5082 data points. The set split was 80% training and 20% validation.
For the LSTM the network was trained for a total of 24 epochs with a batch size of 128 and using an Adam optimizer where the learning rate was 0.006.
Although the baselines performed considerably well the objective here was to show that an LSTM can be used in the future to model the entire system with an LSTM.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:flooraccuracy.png|500px]] </div>

The above chart shows the success that their system is able to achieve in floor level prediction.

=Future Work=
The first part of the system used an LSTM for indoor/outdoor classification. Therefore, this separate module can be used in many other location problems. Working on this separate problem seems to be an approach that the authors will take. They also would like to aim towards modeling the whole problem within the LSTM in order to generate floor level predictions solely from sensor reading data.

=Critique=

In this paper, the authors presented a novel system which can predict a smartphone's floor level with 100% accuracy, which has not been done. Previous work relied heavily on pre-training and information regarding the building or users beforehand. Their work can generalize well to many types of tall buildings which are more than 19 stories. Another benefit to their system is that they don't need any additional infrastructure support in advance making it a practical solution for deployment.

A weakness is that they claim that they can get 100% accuracy, but this is only if they know the floor to ceiling height, and their accuracy relies on this key piece of information. Otherwise, when conditioned on the height of the building their accuracy drops by 35% to 65%.

It is also not clear that the LSTM is the best approach especially since a simple feedforward network achieved the same accuracy in their experiments.

They also go against their claim stated at the beginning of the paper where they say they "..does not require the use of beacons, prior knowledge of the building infrastructure..." as in their clustering step they are in a way using prior knowledge from previous visits [4].

The authors also recognize several potential failings of their method. One is that their algorithm will not differentiate based on the floor of the building the user entered on (if there are entrances on multiple floors). In addition, they state that a user on the roof could be detected as being on the ground floor. It was not mentioned/explored in the paper, but a person being on a balcony (ex: attached to an apartment) may have the same effect. These sources of error will need to be corrected before this or a similar algorithm can be implemented; otherwise, the algorithm may provide misleading data to rescue crews, etc.

Overall this paper is not too novel, as they don't provide any algorithmic improvement over the state of the art. Their methods are fairly standard ML techniques and they have only used out of the box solutions. There is no clear intuition why the proposed work well for the authors. This application could be solved using simpler methods like having an emergency push button on each floor. Authors don't provide sufficient motivation for why deep learning would be a good solution to this problem.

=References=

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.

[2] Parnandi, A., Le, K., Vaghela, P., Kolli, A., Dantu, K., Poduri, S., & Sukhatme, G. S. (2009, October). Coarse in-building localization with smartphones. In International Conference on Mobile Computing, Applications, and Services (pp. 343-354). Springer, Berlin, Heidelberg.

[3] Wonsang Song, Jae Woo Lee, Byung Suk Lee, Henning Schulzrinne. "Finding 9-1-1 Callers in Tall Buildings". IEEE WoWMoM '14. Sydney, Australia, June 2014.

[4] W Falcon, H Schulzrinne, Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data, 2018

[5] Kawakubo, Hideko and Hiroaki Yoshida. “Rapid Feature Selection Based on Random Forests for High-Dimensional Data.” (2012).

[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (February 2012), 281-305.

[7] Greg Milette, Adam Stroud: Professional Android Sensor Programming, 2012, Wiley India

Co-Teaching

2018-11-06T17:35:28Z

Vrajendr: /* Conclusions */

=Introduction=
==Title of Paper==
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels
==Contributions==
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness
of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art
baselines.

==Terminology==
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data.

Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels.

=Intuition=
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually while learning on a batch of noisy data only the error from the network itself is transferred back to facilitate learning. But in the case of Co-Teaching the two networks that are used are able to filter the different type of errors which flows back to themselves as well as the other network. Therefore the models learn mutually, i.e., from themselves as well as from the partner network.

=Motivation=
The paper draws motivation from two key facts:
• That many data collection processes yield noisy labels.
• That deep neural networks have a high capacity to overfit to noisy labels.
Because of these facts, it is challenging to train deep networks to be robust with noisy labels.
=Related Works=

Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling.

In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators.

There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously.
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators.

Learning to teach methods is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks.

=Co-Teaching Algorithm=

[[File:Co-Teaching_Algorithm.png|600px|center]]

The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network.

=Summary of Experiment=
==Proposed Method==
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network.
[[File:Co-Teaching Fig 1.png|600px|center]]
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate).
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself.
==Dataset Corruption==
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry.
[[File:Co-Teaching Fig 2.png|600px|center]]
Three noise conditions are simulated for comparing co-teaching with baseline methods.
{| class="wikitable"
{| border="1" cellpadding="3"
|-
|width="60pt"|Method
|width="100pt"|Noise Rate
|width="700pt"|Rationale
|-
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels.
|-
| Symmetry || 50% || Half of the instances have noisy labels. Further rationale can be found at [1].
|-
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario.
|}
|}
==Baseline Comparisons==
The co-teaching method is compared with several baseline approaches, which have varying:
• proficiency in dealing with a large number of classes,
• ability to resist heavy noise,
• need to combine with specific network architectures, and
• need to be pretrained.

[[File:Co-Teaching Fig 3.png|600px|center]]
===Bootstrap===
A method that deems a weighted combination of predicted and original labels as correct, and then solves kernels by backpropagation [2].
===S-Model===
Using an additional softmax layer to model the noise transition matrix [3].
===F-Correction===
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].
===Decoupling===
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].
===MentorNet===
A teacher network is trained to identify and discard noisy instances in order to train the student network on cleaner instances [6].
==Implementation Details==
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization.

[[File: Co-Teaching Table 3.png|center]]
=Results and Discussion=
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows.
==MNIST==
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations

The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet.

[[File:Co-Teaching Table 4.png|550px|center]]

[[File:Co-Teaching Graphs MNIST.PNG|center]]

==CIFAR10==
[[File:Co-Teaching Table 5.png|550px|center]]

[[File:Co-Teaching Graphs CIFAR10.PNG|center]]
==CIFAR100==
[[File:Co-Teaching Table 6.png|550px|center]]

[[File: Co-Teaching Graphs CIFAR100.PNG|center]]

=Conclusions=
The main goal of the paper is to introduce the “Co-teaching” learning paradigm that uses two deep neural networks learning simultaneously to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depending on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).

=Critique=
==Lack of Task Diversity==
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality.
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm.
==Lack of Theoretical Development (Mentioned in conclusion)==
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.
=References=
[1] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen. MNE software for processing MEG and EEG data. Neuroimage, 86:446–460, 2014.

[2] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.

[3] M. Jas, T. Dupré La Tour, U. Şimşekli, and A. Gramfort. Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1–15, 2017.

[4] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.

[5] R. Chalasani, J. C. Principe, and N. Ramakrishnan. A fast proximal method for convolutional sparse coding. In International Joint Conference on Neural Networks (IJCNN), pages 1–5, 2013. ISBN 9781467361293.

[6] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), pages 5135–5143. IEEE, 2015.

Co-Teaching

2018-11-06T17:33:28Z

Vrajendr: /* Co-Teaching Algorithm */

=Introduction=
==Title of Paper==
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels
==Contributions==
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness
of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art
baselines.

==Terminology==
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data.

Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels.

=Intuition=
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually while learning on a batch of noisy data only the error from the network itself is transferred back to facilitate learning. But in the case of Co-Teaching the two networks that are used are able to filter the different type of errors which flows back to themselves as well as the other network. Therefore the models learn mutually, i.e., from themselves as well as from the partner network.

=Motivation=
The paper draws motivation from two key facts:
• That many data collection processes yield noisy labels.
• That deep neural networks have a high capacity to overfit to noisy labels.
Because of these facts, it is challenging to train deep networks to be robust with noisy labels.
=Related Works=

Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling.

In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators.

There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously.
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators.

Learning to teach methods is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks.

=Co-Teaching Algorithm=

[[File:Co-Teaching_Algorithm.png|600px|center]]

The idea as shown in the algorithm above is to train two deep networks simultaneously. In each mini-batch, each network selects its small-loss instances as useful knowledge and then teaches these useful instances to the peer network.

=Summary of Experiment=
==Proposed Method==
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network.
[[File:Co-Teaching Fig 1.png|600px|center]]
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate).
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself.
==Dataset Corruption==
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry.
[[File:Co-Teaching Fig 2.png|600px|center]]
Three noise conditions are simulated for comparing co-teaching with baseline methods.
{| class="wikitable"
{| border="1" cellpadding="3"
|-
|width="60pt"|Method
|width="100pt"|Noise Rate
|width="700pt"|Rationale
|-
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels.
|-
| Symmetry || 50% || Half of the instances have noisy labels. Further rationale can be found at [1].
|-
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario.
|}
|}
==Baseline Comparisons==
The co-teaching method is compared with several baseline approaches, which have varying:
• proficiency in dealing with a large number of classes,
• ability to resist heavy noise,
• need to combine with specific network architectures, and
• need to be pretrained.

[[File:Co-Teaching Fig 3.png|600px|center]]
===Bootstrap===
A method that deems a weighted combination of predicted and original labels as correct, and then solves kernels by backpropagation [2].
===S-Model===
Using an additional softmax layer to model the noise transition matrix [3].
===F-Correction===
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].
===Decoupling===
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].
===MentorNet===
A teacher network is trained to identify and discard noisy instances in order to train the student network on cleaner instances [6].
==Implementation Details==
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization.

[[File: Co-Teaching Table 3.png|center]]
=Results and Discussion=
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows.
==MNIST==
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations

The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet.

[[File:Co-Teaching Table 4.png|550px|center]]

[[File:Co-Teaching Graphs MNIST.PNG|center]]

==CIFAR10==
[[File:Co-Teaching Table 5.png|550px|center]]

[[File:Co-Teaching Graphs CIFAR10.PNG|center]]
==CIFAR100==
[[File:Co-Teaching Table 6.png|550px|center]]

[[File: Co-Teaching Graphs CIFAR100.PNG|center]]

=Conclusions=
The main goal of the paper is to introduce “Co-teaching” learning paradigm that uses two deep neural networks learning at the same time to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depends on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).

=Critique=
==Lack of Task Diversity==
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality.
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm.
==Lack of Theoretical Development (Mentioned in conclusion)==
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.
=References=
[1] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen. MNE software for processing MEG and EEG data. Neuroimage, 86:446–460, 2014.

[2] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.

[3] M. Jas, T. Dupré La Tour, U. Şimşekli, and A. Gramfort. Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1–15, 2017.

[4] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.

[5] R. Chalasani, J. C. Principe, and N. Ramakrishnan. A fast proximal method for convolutional sparse coding. In International Joint Conference on Neural Networks (IJCNN), pages 1–5, 2013. ISBN 9781467361293.

[6] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), pages 5135–5143. IEEE, 2015.

File:Co-Teaching Algorithm.png

2018-11-06T17:19:16Z

Vrajendr: Vrajendr uploaded a new version of File:Co-Teaching Algorithm.png

File:Co-Teaching Algorithm.png

2018-11-06T17:18:44Z

Vrajendr:

Co-Teaching

2018-11-06T17:18:30Z

Vrajendr:

=Introduction=
==Title of Paper==
Co-teaching: Robust Training Deep Neural Networks with Extremely Noisy Labels
==Contributions==
The paper proposes a novel approach to training deep neural networks on data with noisy labels. The proposed architecture, named ‘co-teaching’, maintains two networks simultaneously. The experiments are conducted on noisy versions of MNIST, CIFAR-10 and CIFAR-100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness
of deep learning models trained by the Co-teaching approach is much superior to state-of-the-art
baselines.

==Terminology==
Ground-Truth Labels: The proper objective labels (i.e. the real, or ‘true’, labels) of the data.

Noisy Labels: Labels that are corrupted (either manually or through the data collection process) from ground-truth labels.

=Intuition=
The Co-teaching architecture maintains two networks with different learning abilities simultaneously. The reason why Co-teaching is more robust can be explained as follows. Usually while learning on a batch of noisy data only the error from the network itself is transferred back to facilitate learning. But in the case of Co-Teaching the two networks that are used are able to filter the different type of errors which flows back to themselves as well as the other network. Therefore the models learn mutually, i.e., from themselves as well as from the partner network.

=Motivation=
The paper draws motivation from two key facts:
• That many data collection processes yield noisy labels.
• That deep neural networks have a high capacity to overfit to noisy labels.
Because of these facts, it is challenging to train deep networks to be robust with noisy labels.
=Related Works=

Some approaches use statistical learning methods for the problem of learning from extremely noisy labels. These approaches can be divided into 3 strands: surrogate loss, noise estimation, and probabilistic modeling.

In the surrogate loss category, one work proposes an unbiased estimator to provide the noise corrected loss approach. Another work presented a robust non-convex loss, which is the special case in a family of robust losses. In the noise rate estimation category, some authors propose a class-probability estimator using order statistics on the range of scores. Another work presented the same estimator using the slope of ROC curve. In the probabilistic modeling category, there is a two coin model proposed to handle noise labels from multiple annotators.

There are also deep learning approaches that can be used to approach data with noisy labels. One work proposed a unified framework to distill knowledge from clean labels and knowledge graphs. Another work trained a label cleaning network by a small set of clean labels and used it to reduce the noise in large-scale noisy labels. There is also a proposed joint optimization framework to learn parameters and estimate true labels simultaneously.
Another work leverages an additional validation set to adaptively assign weights to training examples in every iteration. One particular paper ads a crowd layer after the output layer for noisy labels from multiple annotators.

Learning to teach methods is another approach to this problem. The methods are made up by the teacher and student networks. The teacher network selects more informative instances for better training of student networks.

=Co-Teaching Algorithm=

[[File:Co-Teaching_Algorithm.png|600px|center]]

=Summary of Experiment=
==Proposed Method==
The proposed co-teaching method maintains two networks simultaneously, and samples instances with small loss at each mini batch. The sample of small-loss instances is then taught to the peer network.
[[File:Co-Teaching Fig 1.png|600px|center]]
The co-teaching method relies on research that suggests deep networks learn clean and easy patterns in initial epochs, but are susceptible to overfitting noisy labels as the number of epochs grows. To counteract this, the co-teaching method reduces the mini-batch size by gradually increasing a drop rate (i.e., noisy instances with higher loss will be dropped at an increasing rate).
The mini-batches are swapped between peer networks due to the underlying intuition that different classifiers will generate different decision boundaries. Swapping the mini-batches constitutes a sort of ‘peer-reviewing’ that promotes noise reduction since the error from a network is not directly transferred back to itself.
==Dataset Corruption==
To simulate learning with noisy labels, the datasets (which are clean by default) are manually corrupted by applying a noise transformation matrix. Two methods are used for generating such noise transformation matrices: pair flipping and symmetry.
[[File:Co-Teaching Fig 2.png|600px|center]]
Three noise conditions are simulated for comparing co-teaching with baseline methods.
{| class="wikitable"
{| border="1" cellpadding="3"
|-
|width="60pt"|Method
|width="100pt"|Noise Rate
|width="700pt"|Rationale
|-
| Pair Flipping || 45% || Almost half of the instances have noisy labels. Simulates erroneous labels which are similar to true labels.
|-
| Symmetry || 50% || Half of the instances have noisy labels. Further rationale can be found at [1].
|-
| Symmetry || 20% || Verify the robustness of co-teaching in a low-level noise scenario.
|}
|}
==Baseline Comparisons==
The co-teaching method is compared with several baseline approaches, which have varying:
• proficiency in dealing with a large number of classes,
• ability to resist heavy noise,
• need to combine with specific network architectures, and
• need to be pretrained.

[[File:Co-Teaching Fig 3.png|600px|center]]
===Bootstrap===
A method that deems a weighted combination of predicted and original labels as correct, and then solves kernels by backpropagation [2].
===S-Model===
Using an additional softmax layer to model the noise transition matrix [3].
===F-Correction===
Correcting the prediction by using a noise transition matrix which is estimated by a standard network [4].
===Decoupling===
Two separate classifiers are used in this technique. Parameters are updated using only the samples that are classified differently between the two models [5].
===MentorNet===
A teacher network is trained to identify and discard noisy instances in order to train the student network on cleaner instances [6].
==Implementation Details==
Two CNN models using the same architecture (shown below) are used as the peer networks for the co-teaching method. They are initialized with different parameters in order to be significantly different from one another (different initial parameters can lead to different local minima). An Adam optimizer (momentum=0.9), a learning rate of 0.001, a batch size of 128, and 200 epochs are used for each dataset. The networks also utilize dropout and batch normalization.

[[File: Co-Teaching Table 3.png|center]]
=Results and Discussion=
The co-teaching algorithm is compared to the baseline approaches under the noise conditions previously described. The results are as follows.
==MNIST==
The results of testing on the MNIST dataset are shown below. The Symmetry-20% case can be taken as a near-baseline; all methods perform well. However, under the Symmetry-50% case, all methods except MentorNet and Co-Teaching drop below 90% accuracy. Under the Pair-45% case, all methods except MentorNet and Co-Teaching drop below 60%. Under both high-noise conditions, the Co-Teaching method produces the highest accuracy. Similar patterns can be seen in the two additional sets of test results, though the specific accuracy values are different. Co-Teaching performs best under the high-noise situations

The images labelled 'Figure 3' show test accuracy with respect to epoch of the various algorithms. Many algorithms show evidence of over-fitting or being influenced by noisy data, after reaching initial high accuracy. MentorNet and Co-Teaching experience this less than other methods, and Co-Teaching generally achieves higher accuracy than MentorNet.

[[File:Co-Teaching Table 4.png|550px|center]]

[[File:Co-Teaching Graphs MNIST.PNG|center]]

==CIFAR10==
[[File:Co-Teaching Table 5.png|550px|center]]

[[File:Co-Teaching Graphs CIFAR10.PNG|center]]
==CIFAR100==
[[File:Co-Teaching Table 6.png|550px|center]]

[[File: Co-Teaching Graphs CIFAR100.PNG|center]]

=Conclusions=
The main goal of the paper is to introduce “Co-teaching” learning paradigm that uses two deep neural networks learning at the same time to avoid noisy labels. Experiments are performed on several datasets such as MNIST, CIFAR-10, and CIFAR-100. The performance varied depends on the noise level in different scenarios. In the simulated ‘extreme noise’ scenarios, (pair-45% and symmetry-50%), the co-teaching methods outperforms baseline methods in terms of accuracy. This suggests that the co-teaching method is superior to the baseline methods in scenarios of extreme noise. The co-teaching method also performs competitively in the low-noise scenario (symmetry-20%).

=Critique=
==Lack of Task Diversity==
The datasets used in this experiment are all image classification tasks – these results may not generalize to other deep learning applications, such as classifications from data with lower or higher dimensionality.
==Needs to be expanded to other weak supervisions (Mentioned in conclusion)==
Adaptation of the co-teaching method to train under other weak supervision (e.g. positive and unlabeled data) could expand the applicability of the paradigm.
==Lack of Theoretical Development (Mentioned in conclusion)==
This paper lacks any theoretical guarantees for co-teaching. Proving that the results shown in this study are generalizable would bolster the findings significantly.
=References=
[1] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen. MNE software for processing MEG and EEG data. Neuroimage, 86:446–460, 2014.

[2] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.

[3] M. Jas, T. Dupré La Tour, U. Şimşekli, and A. Gramfort. Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1–15, 2017.

[4] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.

[5] R. Chalasani, J. C. Principe, and N. Ramakrishnan. A fast proximal method for convolutional sparse coding. In International Joint Conference on Neural Networks (IJCNN), pages 1–5, 2013. ISBN 9781467361293.

[6] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), pages 5135–5143. IEEE, 2015.

Annotating Object Instances with a Polygon RNN

2018-11-06T05:10:58Z

Vrajendr: /* Conclusion */

Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']

The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].

= Introduction =

If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.

Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):

1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.

2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.

3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category.

4. Instance Segmentation: The goal, here, is to not only to assign pixel-level categorical labels, but also to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.

[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]

== Motivation ==

Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.

Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming. Thus, the main goal of the paper is to enable '''semi-automatic''' annotation of object instances.

Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods; this approach works as the silhouette of an object is typically connected without holes. Thus, the authors' intuition behind the success of this method is the sparse nature of these polygons that allow representation of an object through a cluster of pixels rather than a pixel level description.

= Related Works =

The related works are related to the fields of semi-automatic image annotation and object instance segmentation.

The critical advances in the field of semi-automatic annotation are as follows:

1) Some researchers use scribbles as seeds to model the appearance of foreground and background.

2) Other works use multiple scribbles on the object and exploit motion cues to annotate an object in a video.

3) Scribbles are also used to train CNNs for semantic image segmentation.

4) Some methods exploit annotations in the form of 2D bounding boxes and perform per-pixel labeling with foreground/background models using EM.

The following are the critical advances in the field of Object instance segmentation:

1) Most of the approaches operate on pixel-level and exploit a CNN inside a box or a patch to perform the labeling.

2) Some approaches aim to produce a polygon around an object. These approaches start by detecting edge fragments and find an optimal cycle that links the edges into a coherent region.

3) One particular work uses superpixels in the form of small polygons which is combined into object regions with the aim to label aerial images.

= Model =

As an input to the model, an annotator or perhaps another neural network provides a ground-truth bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.

The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.

The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.

== Architecture ==

There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.

[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 2: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type).]]

1. '''CNN with skip connections''':

The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of feature fused together in a tensor that can feed into the RNN (refer to Figure 2). Namely, the last max-pool layer present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passing through 2 pooling layers with 128 and 2 convolutional layers. At each of these four steps, the idea is to have a width of 512 and so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features as well as boundary/semantic information about the instances. Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.

2. '''RNN - 2 Layer ConvLSTM'''

The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which
concatenates 4 features: the CNN feature representation of the image, one-hot encodingof the previous predicted vertex and the vertex predicted
from two time steps ago, as well as the one-hot encoding of the first predicted vertex.

The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict the first vertex.

== Training ==

The training of the model is done as follows:

1. Cross-entropy is used for the RNN cost function.

2. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10)

3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function.

The reported time for training is one day on a Nvidia Titan-X GPU.

=== Human Annotator in the Loop ===

The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". The typical inference time as quoted by the paper is 250ms per object.

== Results ==

The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. The standard Intersection over Union (IoU) measure is used for comparison. The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.

[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]

Compared to other instance segmentation techniques, the Polygon-RNN method performs significantly better in the person, car, and rider categories and above average in other categories. In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.

[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]

In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.

<gallery widths=500px heights=500px perrow=2 mode="packed">
File:Figure_3_Neel.JPG|Figure 3: Qualitative results without human annotator in the loop.|alt=alt language
File:Figure_4_Neel.JPG|Figure 4: Qualitative results: comparison with human annotator.|alt=alt language
</gallery>

=Conclusion=

The important conclusions from this paper are:

1. The paper presented a powerful generic annotation tool that works on different unseen datasets.

2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself (speed-up factor of 4.74).

3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.

4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.

5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.

=Critique=

1. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.

2. The method outperforms other methods only in the three categories mentioned but isn't a significant improvement in other categories.

3. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.

4. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.

5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.

Zero-Shot Visual Imitation

2018-11-06T04:55:55Z

Vrajendr: /* Forward Consistency Loss */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach a desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from.

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Mutli-step GSP with previous action history; (c) Mutli-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Mutli-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'' and it is learned by re-labeling states that the agent has visited as goals and the actions taken as prediction targets. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; that is, there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'' which essentially says that reaching the goal is more important than how it is reached. First, a forward model is learned that predicts the next observation from the given action and current observation. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss has the effect of not inadvertently penalizing actions that are ''consistent'' with the ground-truth action but not exactly the same.

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like this forward-consistent loss, issues related to the number of steps needed to reach a certain goal become prevalent. To address this, the paper pairs the GSP with a goal recognizer that determines if the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram d) showing the forward-consistent loss proposed in this paper.

The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation and navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment using the policy <math display="inline">a = π_E(s)</math>. This exploration data is used to learn the GSP.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The above equation represents the learned GSP. <math display="inline">π</math> takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs the sequence of required actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> from the current observation <math display="inline">x_i</math>. The states <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and the number of actions <math display="inline">K</math> is inferred by the model. <math display="inline">π</math> can be thought of as a general deep network with parameters <math display="inline">θ_π</math>. It is good to note that <math display="inline">x_g</math> could be an intermediate subtask of the overall goal. So in essence, subtasks can be strung together to achieve an overall goal (i.e. go to position 1, then go to position 2, then go to final destination).

Let the sequence of images <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math> be the task to be imitated which is captured when the expert demonstrates the task. The sequence has at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses the learned GSP <math display="inline">π</math> to start from initial state <math display="inline">x_0</math> and follow the actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to imitate the observations in <math display="inline">D</math>.

A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

It is easy to first describe the one-step version of GSP and then extend it to a multi-step version. The one-step trajectory can be generalized as <math display="inline">(x_t; a_t; x_{t+1})</math> with GSP <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math> and is trained by the standard cross-entropy loss given below with respect to the GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">p</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted action distributions respectively. <math display="inline">p</math> is not readily available so it is empirically approximated using samples from the distribution <math display="inline">a_t</math>. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math> but this is violated if <math display="inline">p</math> is multi-modal and high-dimensional. That is, the same inputs would be presented with different targets leading to high variance in gradients. This would make learning challenging, leading to the further developments presented in sections 2.2, 2.3 and 2.4.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (prediction from executing <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss... The forward dynamics <math display="inline">f</math> are learned from the data and is defined as <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust so this paper follows in extending the GSP to make predictions on feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math>. The forward consistency loss is then computed to make predictions in the feature space \phi.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not. The goal classifier was trained using the standard cross-entropy loss.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forward consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forward consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration the robot picked up the robot at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image of multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image have no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in demonstration
has no overlap with its current observation. Even under this condition the robot is able to move close
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. The goal was to measure robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modelling forward consistency loss in feature space in comparison to raw visual space.

Data was collected using two method: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work in general does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In ICML, 2017.

Annotating Object Instances with a Polygon RNN

2018-11-06T04:53:32Z

Vrajendr: /* Human Annotator in the Loop */

Summary of the CVPR '17 best [https://www.cs.utoronto.ca/~fidler/papers/paper_polyrnn.pdf ''paper'']

The presentation video of paper is available here[https://www.youtube.com/watch?v=S1UUR4FlJ84].

= Introduction =

If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right beside a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps to reason about the behavior of objects in the scene.

Automating this process is a classic computer vision problem and is often termed "object detection". There are four distinct levels of detection (refer to Figure 1 for a visual cue):

1. Classification + Localization: This is the most basic method that detects whether '''an''' object is either present or absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.

2. Object Detection: The classic definition of object detection points to the detection and localization of '''multiple''' objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding to the location of the objects in the image.

3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category.

4. Instance Segmentation: The goal, here, is to not only to assign pixel-level categorical labels, but also to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.

[[File:Figure_1.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]

== Motivation ==

Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.

Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large-scale datasets. This is both expensive and time-consuming. Thus, the main goal of the paper is to enable '''semi-automatic''' annotation of object instances.

Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with a small number of clicks (30 - 40) compared to other methods; this approach works as the silhouette of an object is typically connected without holes. Thus, the authors' intuition behind the success of this method is the sparse nature of these polygons that allow representation of an object through a cluster of pixels rather than a pixel level description.

= Related Works =

The related works are related to the fields of semi-automatic image annotation and object instance segmentation.

The critical advances in the field of semi-automatic annotation are as follows:

1) Some researchers use scribbles as seeds to model the appearance of foreground and background.

2) Other works use multiple scribbles on the object and exploit motion cues to annotate an object in a video.

3) Scribbles are also used to train CNNs for semantic image segmentation.

4) Some methods exploit annotations in the form of 2D bounding boxes and perform per-pixel labeling with foreground/background models using EM.

The following are the critical advances in the field of Object instance segmentation:

1) Most of the approaches operate on pixel-level and exploit a CNN inside a box or a patch to perform the labeling.

2) Some approaches aim to produce a polygon around an object. These approaches start by detecting edge fragments and find an optimal cycle that links the edges into a coherent region.

3) One particular work uses superpixels in the form of small polygons which is combined into object regions with the aim to label aerial images.

= Model =

As an input to the model, an annotator or perhaps another neural network provides a ground-truth bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.

The RNN model predicts the vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently and will be defined shortly. The information regarding the previous two-time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.

The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.

== Architecture ==

There are two primary networks at play: 1. CNN with skip connections, and 2. One-to-many type RNN.

[[File:Figure_2_Neel.JPG | 800px|thumb|center|Figure 2: Model architecture for Polygon-RNN depicting a CNN with skip connections feeding into a 2 layer ConvLSTM (One-to-many type).]]

1. '''CNN with skip connections''':

The authors have adopted the VGG16 feature extractor architecture with a few modifications pertaining to the preservation of feature fused together in a tensor that can feed into the RNN (refer to Figure 2). Namely, the last max-pool layer present in the VGG16 CNN has been removed. The image fed into the CNN is pre-shrunk to a 224x224x3 tensor(3 being the Red, Green, and Blue channels). The image passing through 2 pooling layers with 128 and 2 convolutional layers. At each of these four steps, the idea is to have a width of 512 and so the output tensor at pool2 is convolved with 4 3x3x128 filters and the output tensor at pool3 is convolved with 2 3x3x256 filters. The skip connections from the four layers allow the CNN to extract low-level edge and corner features as well as boundary/semantic information about the instances. Finally, a 3x3 convolution applied along with a ReLU non-linearity results in a 28x28x128 tensor that contains semantic information pertinent to the image frame and is taken as an input by the RNN.

2. '''RNN - 2 Layer ConvLSTM'''

The RNN is employed to capture information about the previous vertices in the time-series. Specifically, a Convolutional LSTM is used as a decoder. The ConvLSTM allows preservation of the spatial information in 2D and reduces the number of parameters compared to a Fully Connected RNN. The polygon is modeled with a kernel size of 3x3 and 16 channels outputting a vertex at each time step. The ConvLSTM gets as input a tensor step t which
concatenates 4 features: the CNN feature representation of the image, one-hot encodingof the previous predicted vertex and the vertex predicted
from two time steps ago, as well as the one-hot encoding of the first predicted vertex.

The authors have treated the vertex prediction task as a classification task in that the location of the vertices is through a one-hot representation of dimension DxD + 1 (D chosen to be 28 by the authors in tests). The one additional dimension is the storage cue for loop closure for the polygon. Given that, the one-hot representation of the two previously predicted vertices and the first vertex are taken in as an input, a clockwise (or for that reason any fixed direction) direction can be forced for the creation of the polygon. Coming back to the prediction of the first vertex, this is done through further modification of the CNN by adding two DxD layers with one branch predicting object instance boundaries while the other takes in this output as well as the image features to predict the first vertex.

== Training ==

The training of the model is done as follows:

1. Cross-entropy is used for the RNN cost function.

2. Instead of Stochastic Gradient Descent, Adam is used for optimization: batch size = 8, learning rate = 1e^-4 (learning rate decays after 10 epochs by a factor of 10)

3. For the first vertex prediction, the modified CNN mentioned previously, is trained using a multi-task cost function.

The reported time for training is one day on a Nvidia Titan-X GPU.

=== Human Annotator in the Loop ===

The model allows for the prediction at a given time step to be corrected and this corrected vertex is then fed into the next time step of the RNN, effectively rejecting the network predicted vertex. This has the simple effect of putting the model "back on the right track". The typical inference time as quoted by the paper is 250ms per object.

== Results ==

The evaluation of the model performance was conducted based on the Cityscapes and KITTI Datasets. The standard Intersection over Union (IoU) measure is used for comparison. The calculation for IoU takes both the predicted and ground-truth object boundaries. The intersection (area contained in both boundaries at once) is divided by the union (the area contained by at least one, or both, of the boundaries). A low score of this metric would mean that there is little overlap between the boundaries, or large areas on non-overlap, and a score of 1.0 would indicate that the two boundaries contain the same area.

[[File:Table_1_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]

Compared to other instance segmentation techniques, the Polygon-RNN method performs significantly better in the person, car, and rider categories and above average in other categories. In addition, with the help of the annotator, the speedup factor was 7.3 times with under 5 clicks which the authors claim is the main advantage of this method.

[[File:Table_2_Neel.JPG | 800px|thumb|center|Table 1: IoU performance on Cityscapes data without any annotator intervention.]]

In addition, most of the comparisons with human annotators show that the method is at par with human-level annotation.

<gallery widths=500px heights=500px perrow=2 mode="packed">
File:Figure_3_Neel.JPG|Figure 3: Qualitative results without human annotator in the loop.|alt=alt language
File:Figure_4_Neel.JPG|Figure 4: Qualitative results: comparison with human annotator.|alt=alt language
</gallery>

=Conclusion=

The important conclusions from this paper are:

1. The paper presented a powerful generic annotation tool that works on different unseen datasets.

2. Significant improvement in annotation time can be achieved with the Polygon-RNN method itself.

3. However, the flexibility of having inputs from a human annotator helps increase the IoU for a certain range of clicks.

4. The model architecture has a down-sampling factor of 16 and the final output resolution and accuracy is sensitive to object size.

5. Another downside of the model architecture is that training time is increased due to the training of the CNN for the first vertex.

=Critique=

1. This paper requires training of an entire CNN for the first vertex and is inefficient in that sense as it introduces additional parameters adding to the computation time and resource demand.

2. The method outperforms other methods only in the three categories mentioned but isn't a significant improvement in other categories.

3. The baseline methods have an upper hand compared to this model when it comes to larger objects since the nature of the down-scaled structure adopted by this model.

4. With the human annotator in the loop, the model speeds up the process of annotation by over 7 times which is perhaps a big cost and time cutting improvement for companies.

5. In terms of future work, elimination of the additional CNN for the first vertex as well as an enhanced architecture to remain insensitive to the size of the object to be annotated should be implemented.

Zero-Shot Visual Imitation

2018-11-06T04:50:19Z

Vrajendr: /* Paper Overview */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach a desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from.

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Mutli-step GSP with previous action history; (c) Mutli-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Mutli-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'' and it is learned by re-labeling states that the agent has visited as goals and the actions taken as prediction targets. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; that is, there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'' which essentially says that reaching the goal is more important than how it is reached. First, a forward model is learned that predicts the next observation from the given action and current observation. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss has the effect of not inadvertently penalizing actions that are ''consistent'' with the ground-truth action but not exactly the same.

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like this forward-consistent loss, issues related to the number of steps needed to reach a certain goal become prevalent. To address this, the paper pairs the GSP with a goal recognizer that determines if the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram d) showing the forward-consistent loss proposed in this paper.

The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation and navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment using the policy <math display="inline">a = π_E(s)</math>. This exploration data is used to learn the GSP.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The above equation represents the learned GSP. <math display="inline">π</math> takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs the sequence of required actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> from the current observation <math display="inline">x_i</math>. The states <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and the number of actions <math display="inline">K</math> is inferred by the model. <math display="inline">π</math> can be thought of as a general deep network with parameters <math display="inline">θ_π</math>. It is good to note that <math display="inline">x_g</math> could be an intermediate subtask of the overall goal. So in essence, subtasks can be strung together to achieve an overall goal (i.e. go to position 1, then go to position 2, then go to final destination).

Let the sequence of images <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math> be the task to be imitated which is captured when the expert demonstrates the task. The sequence has at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses the learned GSP <math display="inline">π</math> to start from initial state <math display="inline">x_0</math> and follow the actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to imitate the observations in <math display="inline">D</math>.

A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

It is easy to first describe the one-step version of GSP and then extend it to a multi-step version. The one-step trajectory can be generalized as <math display="inline">(x_t; a_t; x_{t+1})</math> with GSP <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math> and is trained by the standard cross-entropy loss given below with respect to the GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">p</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted action distributions respectively. <math display="inline">p</math> is not readily available so it is empirically approximated using samples from the distribution <math display="inline">a_t</math>. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math> but this is violated if <math display="inline">p</math> is multi-modal and high-dimensional. That is, the same inputs would be presented with different targets leading to high variance in gradients. This would make learning challenging, leading to the further developments presented in sections 2.2, 2.3 and 2.4.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (prediction from executing <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss... The forward dynamics <math display="inline">f</math> are learned from the data and is defined as <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have show that learning forward dynamics in the feature space as opposed to raw observation space is more robust so this paper follows in extending the GSP to make predictions on feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math>. The forward consistency loss is then computed to make predictions in the feature space \phi.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not. The goal classifier was trained using the standard cross-entropy loss.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forward consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forward consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration the robot picked up the robot at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image of multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image have no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in demonstration
has no overlap with its current observation. Even under this condition the robot is able to move close
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. The goal was to measure robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modelling forward consistency loss in feature space in comparison to raw visual space.

Data was collected using two method: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work in general does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In ICML, 2017.

Zero-Shot Visual Imitation

2018-11-06T04:38:56Z

Vrajendr: /* References */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach a desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from.

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Mutli-step GSP with previous action history; (c) Mutli-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Mutli-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'' and it is learned by re-labeling states that the agent has visited as goals and the actions taken as prediction targets. During inference/prediction, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; that is, there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'' which essentially says that reaching the goal is more important than how it is reached. First, a forward model is learned that predicts the next observation from the given action and current observation. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss has the effect of not inadvertently penalizing actions that are ''consistent'' with the ground-truth action but not exactly the same.

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like this forward-consistent loss, issues related to the number of steps needed to reach a certain goal become prevalent. To address this, the paper pairs the GSP with a goal recognizer that determines if the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram d) showing the forward-consistent loss proposed in this paper.

The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation and navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment using the policy <math display="inline">a = π_E(s)</math>. This exploration data is used to learn the GSP.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The above equation represents the learned GSP. <math display="inline">π</math> takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs the sequence of required actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> from the current observation <math display="inline">x_i</math>. The states <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and the number of actions <math display="inline">K</math> is inferred by the model. <math display="inline">π</math> can be thought of as a general deep network with parameters <math display="inline">θ_π</math>. It is good to note that <math display="inline">x_g</math> could be an intermediate subtask of the overall goal. So in essence, subtasks can be strung together to achieve an overall goal (i.e. go to position 1, then go to position 2, then go to final destination).

Let the sequence of images <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math> be the task to be imitated which is captured when the expert demonstrates the task. The sequence has at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses the learned GSP <math display="inline">π</math> to start from initial state <math display="inline">x_0</math> and follow the actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to imitate the observations in <math display="inline">D</math>.

A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

It is easy to first describe the one-step version of GSP and then extend it to a multi-step version. The one-step trajectory can be generalized as <math display="inline">(x_t; a_t; x_{t+1})</math> with GSP <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math> and is trained by the standard cross-entropy loss given below with respect to the GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">p</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted action distributions respectively. <math display="inline">p</math> is not readily available so it is empirically approximated using samples from the distribution <math display="inline">a_t</math>. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math> but this is violated if <math display="inline">p</math> is multi-modal and high-dimensional. That is, the same inputs would be presented with different targets leading to high variance in gradients. This would make learning challenging, leading to the further developments presented in sections 2.2, 2.3 and 2.4.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (prediction from executing <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss... The forward dynamics <math display="inline">f</math> are learned from the data and is defined as <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have show that learning forward dynamics in the feature space as opposed to raw observation space is more robust so this paper follows in extending the GSP to make predictions on feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math>. The forward consistency loss is then computed to make predictions in the feature space \phi.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not. The goal classifier was trained using the standard cross-entropy loss.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forward consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forward consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration the robot picked up the robot at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image of multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image have no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in demonstration
has no overlap with its current observation. Even under this condition the robot is able to move close
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. The goal was to measure robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modelling forward consistency loss in feature space in comparison to raw visual space.

Data was collected using two method: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work in general does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In ICML, 2017.

Zero-Shot Visual Imitation

2018-11-06T04:37:56Z

Vrajendr: /* References */

This page contains a summary of the paper "[https://openreview.net/pdf?id=BkisuzWRW Zero-Shot Visual Imitation]" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.

==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach a desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from.

===Paper Overview===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Mutli-step GSP with previous action history; (c) Mutli-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Mutli-step GSP with forward consistency loss proposed in this work.]]

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'' and it is learned by re-labeling states that the agent has visited as goals and the actions taken as prediction targets. During inference/prediction, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; that is, there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'' which essentially says that reaching the goal is more important than how it is reached. First, a forward model is learned that predicts the next observation from the given action and current observation. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss has the effect of not inadvertently penalizing actions that are ''consistent'' with the ground-truth action but not exactly the same.

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like this forward-consistent loss, issues related to the number of steps needed to reach a certain goal become prevalent. To address this, the paper pairs the GSP with a goal recognizer that determines if the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram d) showing the forward-consistent loss proposed in this paper.

The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation and navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

===Related Work===
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

==Learning to Imitate Without Expert Supervision==

In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.

Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment using the policy <math display="inline">a = π_E(s)</math>. This exploration data is used to learn the GSP.

<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>

The above equation represents the learned GSP. <math display="inline">π</math> takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs the sequence of required actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> from the current observation <math display="inline">x_i</math>. The states <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and the number of actions <math display="inline">K</math> is inferred by the model. <math display="inline">π</math> can be thought of as a general deep network with parameters <math display="inline">θ_π</math>. It is good to note that <math display="inline">x_g</math> could be an intermediate subtask of the overall goal. So in essence, subtasks can be strung together to achieve an overall goal (i.e. go to position 1, then go to position 2, then go to final destination).

Let the sequence of images <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math> be the task to be imitated which is captured when the expert demonstrates the task. The sequence has at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses the learned GSP <math display="inline">π</math> to start from initial state <math display="inline">x_0</math> and follow the actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to imitate the observations in <math display="inline">D</math>.

A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached.

===Learning the Goal-Conditioned Skill Policy (GSP)===

It is easy to first describe the one-step version of GSP and then extend it to a multi-step version. The one-step trajectory can be generalized as <math display="inline">(x_t; a_t; x_{t+1})</math> with GSP <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math> and is trained by the standard cross-entropy loss given below with respect to the GSP parameters <math display="inline">θ_π</math>:

<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>

<math display="inline">p</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted action distributions respectively. <math display="inline">p</math> is not readily available so it is empirically approximated using samples from the distribution <math display="inline">a_t</math>. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math> but this is violated if <math display="inline">p</math> is multi-modal and high-dimensional. That is, the same inputs would be presented with different targets leading to high variance in gradients. This would make learning challenging, leading to the further developments presented in sections 2.2, 2.3 and 2.4.

===Forward Consistency Loss===

To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (prediction from executing <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients and aid the learning process. This is what is denoted as ''forward consistency loss''.

To operationalize the forward consistency loss... The forward dynamics <math display="inline">f</math> are learned from the data and is defined as <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency. In summary, the loss function is given below:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>

Past works have show that learning forward dynamics in the feature space as opposed to raw observation space is more robust so this paper follows in extending the GSP to make predictions on feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math>. The forward consistency loss is then computed to make predictions in the feature space \phi.

The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:

<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>

<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.

===Goal Recognizer===

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not. The goal classifier was trained using the standard cross-entropy loss.

===Ablations and Baselines===

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forward consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forward consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.

==Experiments==

The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

===Rope Manipulation===

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration the robot picked up the robot at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]

[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]

===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image of multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
(top-left). Since the initial and goal image have no overlap, the robot first explores the environment
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
such exploratory behavior naturally emerged from the self-supervised learning.]]

[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in demonstration
has no overlap with its current observation. Even under this condition the robot is able to move close
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]

[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]

===3D Navigation in VizDoom===

To round off the experiments, a VizDoom simulation environment was used to test the GSP. The goal was to measure robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modelling forward consistency loss in feature space in comparison to raw visual space.

Data was collected using two method: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]

==Discussion==

This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work in general does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work.

==References==

[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.