statwiki - User contributions [US]

F18-STAT946-Proposal

2018-12-12T02:10:23Z

S366chen:

'''Project # 0'''
Group members:

Last name, First name

Last name, First name

Last name, First name

Last name, First name

'''Title:''' Making a String Telephone

'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).

--------------------------------------------------------------------

'''Project # 1'''
Group members:

Zhang, Xinyue

Zhang, Junyi

Chen, Shala

'''Title:''' Airbus Ship Detection Challenge

'''Description:''' The idea and data for this project is taken from https://www.kaggle.com/c/airbus-ship-detection#description. The goal for this project is to build a model that detects all ships in satellite images and put an aligned bounding box segment around the ships we locate. We are going to extract the segmentation map for the ships first, augment the images and train a simple CNN model to detect them.

--------------------------------------------------------------------

'''Project # 2'''
Group members:

Nekoei, Hadi

Afify, Ahmed

Carrillo, Juan

Ganapathi Subramanian, Sriram

'''Title:''' Algorithmic Analysis and Improvements in Multi-Agent Reinforcement Learning in Partially Observable Settings

'''Description:''' Reinforcement learning (RL) is a branch of Machine Learning in which an agent learns to act optimally in an environment using weak reward signals, which is different from strong labels in supervised learning. Multi-Agent Reinforcement Learning (MARL) is composed of multiple agents that can be competing against each other or cooperating together to achieve a common goal.

Our project aims to investigate the performance of several state of the art Multi-Agent Reinforcement Learning (MARL) algorithms in playing the game of Pommerman. This game will be used as a benchmark during a competition that will be held in NIPS 2018 (https://www.pommerman.com). We plan to participate and compare the performance of our agents against agents created by other researchers. Our project also aims to make algorithmic improvements to the state of the art MARL algorithms and come up with a new algorithm that renders best performance in this partially observable multi-agent setting of Pommerman.

In Pommerman, we have two competing teams, each has two agents who work together to defeat the opponent team. The agents move inside the board leaving bombs that can eliminate other agents when exploding in their horizontal or vertical vicinity. The agents can obtain bonuses such as extra bombs, increased bomb range, or ability to kick installed bombs. Our two agents can choose one of the following actions: stop, move up, move left, move down, move right, or lay a bomb. Each agent will receive an 11x11 grid of integer values representing the board state. Additional information will be provided to the agents such as its own position, positions of his teammate and enemies, available bombs, blast strength, kicking ability, and surrounding walls and bombs.

The algorithms that we are considering are:

- Monte Carlo Tree Search and Reinforcement Learning: Combining MCTS with deep neural networks.

- Multi-Agent Deep Deterministic Policy Gradient (DDPG): A technique developed by OpenAI, based on the Deep Deterministic Policy Gradient technique that outperforms traditional Reinforcement Learning algorithms (DQN/DDPG/TRPO) in several environments.

- Opponent Modelling in Deep Reinforcement Learning: based on DQN to model opponents through a Deep Reinforcement Opponent Network (DRON).

We will use Convolutional Neural Networks for data pre-processing, where we extract features from inputs. We will also be using Feed Forward Deep Networks along with Reinforcement learning frameworks in all the algorithms we implement (Deep Reinforcement learning).

--------------------------------------------------------------------

'''Project # 3'''
Group members:

Fisher, Wesley

Pafla, Marvin

Rajendran, Vidyasagar

'''Title:''' Deep Reinforcement Learning for Angry Birds

'''Description:''' According to Artificial Intelligence (AI) researchers, AI’s performance in the game Angry Birds will exceed human performance in the next 3-4 years [1]. We propose a final project that will hopefully bring us closer to this goal by developing an AI model based on deep reinforcement learning to play the game Angry Birds. While AI has been applied to Angry Birds in the past, there are only a few approaches that utilize deep learning such as in [3]. We plan to implement Yuan et al.’s recommendations by creating an Angry Birds reinforcement learning model with more learning dimensions [3]. To further add novelty to our research, we want to explore the potential of extending our model with evolutionary algorithms [2]. To realize this project, we plan to use an existing implementation of Angry Birds (either https://github.com/estevaofon/angry-birds-python or the one provided for the Angry Birds AI competition which can be found at https://aibirds.org).

References:

[1] Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2017). When will AI exceed human performance? Evidence from AI experts. arXiv preprint arXiv:1705.08807.

[2] Risi, S., & Togelius, J. (2017). Neuroevolution in games: State of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games, 9(1), 25-41.

[3] Yuan, Y., Chen, Z., Wu, P., & Chang, L. Enhancing Deep Reinforcement Learning Agent for Angry Birds. https://aibirds.org/2017/aibirds_BNU.pdf

--------------------------------------------------------------------
'''Project # 4'''
Group members:

Heydari, Nargess

Manuel, Jacob

Ravi, Aravind

'''Title:''' Deep Learning for Detection of Steady State Visually Evoked Potentials

'''Description:''' Brain Computer Interfaces (BCIs) enable users to control an external device by modulating their neuronal activity. Steady state visual evoked potential (SSVEP) based BCIs are of particular interest due to their high information transfer rate (ITR) and relatively low amount of training required for use. SSVEP responses are elicited when a user focuses on a flickering light source and are observed prominently in the occipitoparietal area of the cortex. These responses manifest as an increase in amplitude of the frequency components of the EEG signal at the stimulus frequency and harmonic frequencies. Therefore, by analyzing the frequency component dominant in the EEG signals recorded from occipitoparietal area, the stimulus with user’s visual engagement can be identified.The goal of this project is to identify and compare deep learning architectures for classifying SSVEP responses to use in BCIs. Different architectures will be compared with state of the art classification methods (e.g. Canonical Correlation Analysis) through a sensitivity analysis of their accuracy across multiple BCI variables (e.g. analysis window size, subject variability, size of training data, etc.). The goal of this comparison is to establish a new system design to support application of Deep Neural Networks in SSVEP-based BCI. The proposed study will be performed on the SSVEP dataset collected by the eBionics Lab at the University of Waterloo

References

N. S. Kwak, K. R. M ̈uller, and S. W. Lee, “A convolutional neural network for steady state visual evoked potential classification under ambulatory environment,” PLoS One, 2017.

-------------------------------------------------------------------

--------------------------------------------------------------------
'''Project # 5'''
Group members:

Khan, Salman

Naik, Abdul

Koundinya, Shubham

'''Title:''' Deep Learning for Image Captioning

'''Description:''' Image captioning is the automatic generation of textual descriptions from images. It involves identifying the contents of an image, understanding relationships between what has been detected and generating textual descriptions.

It is a challenging task as it includes both Computer Vision and Natural Language Processing components. Furthermore, an image can be described by multiple text statements. We will explore various state-of-the-art translation models focussing primarily on different ways of describing an image.

References

StyleNet: Generating Attractive Visual Captions with Styles.-https://ieeexplore.ieee.org/document/8099591.

-------------------------------------------------------------------

'''Project # 6'''
Group members:

Amirpasha Ghabussi

Kumar, Dhruv

Sahu, Gaurav

Khan, Kashif

'''Title:''' Deep learning model for Question Answering & Machine Comprehension

'''Description:''' Question answering is a computer science discipline within the fields of information retrieval and natural language processing, which is concerned with building systems that automatically answer questions posed by humans in a natural language.

We will try to improve the accuracy of the models that have shown promising results in most of the highly active datasets such as SQuAD or MS-Marco.

References

[1] Bi-Directional Attention Flow For Machine Comprehension - https://arxiv.org/abs/1611.01603

[2] QANet : Combining Local Convolution With Global Self - Attention For Reading Comprehension - https://arxiv.org/abs/1804.09541

'''Project # 7'''
Group members:

Minhas Manpreet Singh

Budnarain Neil

Ameli Soroush

Rezapour Zahra

'''Title:''' SELECT VIA PROXY: EFFICIENT DATA SELECTION FOR TRAINING DEEP NETWORKS

'''Description:''' We shall be participating in the ICLR Reproducibility Challenge 2019. Abstract: At internet scale, applications collect a tremendous amount of data by logging user events, analyzing text, and collecting images. This data powers a variety of machine learning models for tasks such as image classification, language modeling, content recommendation, and advertising. However, training large models over all available data can be computationally expensive, creating a bottleneck in the development of new machine learning models. In this work, we develop a novel approach to efficiently select a subset of training data to achieve faster training with no loss in model predictive performance. In our approach, we first train a small proxy model quickly, which we then use to estimate the utility of individual training data points, and then select the most informative ones for training the large target model. Extensive experiments show that our approach leads to a 1.6× and 1.8× speed-up on CIFAR10 and SVHN by selecting 60% and 50% subsets of the data, while maintaining the predictive performance of the model trained on the entire dataset. Further, our method is robust to design choices.

--------------------------------------------------------------------

'''Project # 8'''
Group members:

Bhatt, Neel

Chen, Henry

Moosa, Johra Muhammad

'''Title:''' Fast and Robust Pedestrian Detection: The Successor Of Fused-DNN+Semantic Segmentation

'''Description:''' Object Detection in computer vision and image processing deals with identifying semantic objects such as buildings, cars, or humans in digital images and videos. Particularly, pedestrian detection has attracted much research interest in recent years due to its significance in robotics and autonomous driving applications. Consequently, the accuracy of pedestrian detection algorithms has improved significantly, and much of this progress seems to be driven by breakthroughs in Deep Neural Networks (DNNs) and the availability of open source pedestrian datasets. The current state-of-art model being the Fused-DNN+Semantic Segmentation mask, which achieves the lowest log-average miss rate (L-AMR) of 8.2, on the CALTECH pedestrian dataset [1]. While these advancements are impressive, many improvements can be made. For example, existing deep pedestrian detection models tend to rely on hand-crafted features and are generally hard to train. In addition, they seem to perform poorly when image quality is reduced or background interference is high. For this reason, we are proposing to survey deeper into the state-of-art pedestrian detection algorithms and ultimately propose an improved DNN model to address some of the limitations.

Reference:

[1] Du, Xianzhi, et al. "Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection." Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017.
--------------------------------------------------------------------

'''Project # 9'''
Group members:

Sigeng Chen

'''Title:''' Humpback Whale Identification

'''Description:''' It is an active Kaggle Challenge https://www.kaggle.com/c/humpback-whale-identification/submit
--------------------------------------------------------------------

'''Project # 10'''
Group Members: Glen Chalatov, Ronnie Feng, Ki Beom Lee, Patrick Li

'''Title:''' Approximation of Lift-and-Project Methods using Large Hidden Layers; A Comparison to Kernel Methods for Manifold Learning

Kernel methods aim to transform features into a higher dimensional space with reasonable computational cost. Using the kernel trick, we can induce nonlinear patterns into our data through linear transformations. We wish to replicate the behaviour of such methods using neural networks. In contrast to autoencoders which perform dimensionality reduction on data, we use the opposite neural network structure to perform dimensionality lift (increase the dimensionality of our data) as follows:

-Feed d-dimensional data into a neural network

-Using hidden layer(s) of dimension p >> d, we attempt to project our data into higher dimensions

-Using a d-dimensional output layer and a loss function that penalizes difference between the output and input, we tune our network and output the p-dimensional hidden layer to arrive at our lifted feature space.

Our project will analyze and contrast the performance of our lifted feature space under a variety of conditions and applications. Our current hypothesis is that these methods will allow for greater flexibility in pattern recognition, but will be more prone to overfitting. If time allows, we will compare our method against traditional lift-and-project ideas from semidefinite optimization.

--------------------------------------------------------------------

'''Project # 11'''
Group Members: Zheng Ma, Jiazhen Chen, Ruijie Zhang, Charupriya Sharma

'''Title:''' Deep Learning Based Automatic Theorem Prover

'''Description:'''

“Formal logic is the science of deduction. It aims to provide systematic means for telling whether or not given conclusions follow from given premises, i.e., whether arguments are valid or invalid” [JEFFREY]

Automatic theorem provers (ATPs) for first order logic have been an active research area in mathematics and computer science. In recent years, several efforts are made to incorporate machine learning into ATPs, boosting their performance. Rocktäschel and Riedel [1] gave an implementation of an Neural Theorem Prover (NTP), which is an end-to-end differentiable version of an automated theorem prover. NTPs are differentiable with respect to symbol representations in a knowledge base, which enables us to learn representations of symbols in ground atoms and parameters of first-order rules of predefined structure using backpropagation. S. Loos, et al. [2] tried to incorporate convolutional neural network into a refutation-based ATP, "E". In their work convolutional neural network is used to provide heuristics for the ATP, in place of human engineered heuristics. C. Kaliszyk, et al. [3] modified a tableaux based ATP (leanCoP), implementing reinforcement learning and Monte-Carlo search as guidance method.

In our project we aim to investigate ATPs with deep learning as its guidence. In particular, we plan to adapt C. Kaliszyk's reinforcement learning implementation, and build a deep reinforcement learning based ATP, and compare its performance with other contemporary ATPs. We would like to investigate if and how deep learning can be used to help automatic reasoning.

'''References'''

[1] T. Rocktäschel, and S. Riedel. "Learning knowledge base inference with neural theorem provers." Proceedings of the 5th Workshop on Automated Knowledge Base Construction. 2016.

[2] S. Loosm et al. "Deep Network Guided Proof Search." Proceedings of the 21st International onference on Logic for Programming, Artificial Intelligence and Reasoning. 2017.

[3] C. Kaliszyk, et al. "Reinforcement Learning of Theorem Proving." NIPS 2018.
--------------------------------------------------------------------

'''Project # 12'''
Group Members: Travis Bender, Ivan Li, Aileen Li, Xudong Peng

'''Title:''' Airbus Ship Detection Challenge

'''Description:'''

Our project is based on the kaggle competition https://www.kaggle.com/c/airbus-ship-detection. The competition requires accurate identification of ships in 768x768 images by providing a bounded box for every ship in an image. Convolution Neural Networks are an existing class of model that can effectively perform this task. Once accurate classification is achieved, optimization is the next step. Reducing the footprint of our model is a key component of a practical algorithm. To that end, model compression is required to balance both model accuracy and speed. Pruning techniques, such as https://openreview.net/forum?id=r1g5b2RcKm, will be used to create a model that accurately bounds the ships in the image in computationally efficient manner.

Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling

2018-11-30T21:30:52Z

S366chen: Undo revision 42117 by S366chen (talk)

This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]

= Introduction and Motivation =

In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite. Deep CNNs for large, non-convex continuous action spaces are not directly applicable. To solve this issue, we conduct a policy search with an efficient stochastic continuous action search on top of policy samples generated from a deep CNN. Our deep CNN still discretizes the state space and the action space. However, in
the stochastic continuous action search, we lift the restriction of the deterministic discretization and conduct a local search procedure in a physical simulator with continuous action samples. In this way, the benefits of both deep neural networks and physical simulators can be realized.

Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement.

This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.

Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.

== Curling ==

Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.

=== Game play ===

A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.

When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).

Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:

* Draw: Throw a rock to a target location
* Freeze: Draw a rock up against another rock
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)

=== Challenges for AI ===

Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.

The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.

Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.

Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.

Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.

=== Terminology ===

* End: A round of the game
* House: The end of the sheet of ice, which contains
* Hammer: The team that throws the last rock of an end 'has the hammer'
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.

== Related Work ==

=== AlphaGo Lee ===

AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol.

Go game:
* Start with 19x19 empty board
* One player take black stones and the other take white stones
* Two players take turns to put stones on the board
* Rules:
1. If one connected part is completely surrounded by the opponents stones, remove it from the board

2. Ko rule: Forbids a board play to repeat a board position
* End when there is no valuable moves on the board.
* Count the territory of both players.
* Add 7.5 points to whites points (called Komi).
[[File:go.JPG|700px|center]]

Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.

The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.

The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network.

Finally, the two networks are combined using Monte-Carlo Tree Search, which performs look ahead search to select the actions for game play.

The use of both policy and value networks are reflected in this paper's work.

=== AlphaGo Zero ===

AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.

This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.

This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.

Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.

The unification of networks and self-play are also reflected in this paper.

=== Curling Algorithms ===

Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.

=== Monte Carlo Tree Search ===

Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers.

With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.

=== Curling Physics and Simulation ===

Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.

== General Background of Algorithms ==

=== Policy and Value Functions ===

A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.

POLICY IMPROVEMENT: LEARNING ACTION POLICY

Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]
where <math> r(s_t) </math> is the return.

POLICY EVALUATION: LEARNING VALUE FUNCTIONS

A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]

=== Monte Carlo Tree Search ===

Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).

MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.

Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.

MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.

The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.

Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.

The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.

A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.

<math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math>

In which

* <math> w_i = </math> number of wins after <math> i</math>th move
* <math> n_i = </math> number of simulations after <math> i</math>th move
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)
* <math> t = </math> total number of simulations for the parent node

Sources: 2,3,4

[[File:MCTS_Diagram.jpg | 500px|center]]

=== Kernel Regression ===

Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.

A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).

[[File:gaussian_kernel.png | 400 px]]

[[File:kernel_regression.png | 250 px]]

The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.

In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.

= Methods =

== Variable Definitions ==

The following variables are used often in the paper:

* <math>s</math>: A state in the game, as described below as the input to the network.
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game
* <math>a_t</math>: The action taken in state <math>s_t</math>
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS
* <math>n_{a_t}</math>: The number of visits to node a in MCTS
* <math>v_{a_t}</math>: The MCTS value estimate of a node

== Network Design ==

The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.

=== Shared Structure ===

The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:

[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]

the input to this network is the following:
* Location of stones
* Order to tee (the center of the sheet)
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.

The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.

=== Policy Network ===

The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.

[[File:policy-value-net.PNG | 700px]]

=== Value Network ===

The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.

== Continuous Action Search ==

The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.

Actions in the continuous space are generated using an MCTS algorithm, with the following steps:

=== Selection ===

From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.

[[File:curling_kernel_equations.png | 400px]]

The UCB formula is then used to select an action to expand.

The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.

=== Expansion ===

The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.

=== Simulation ===

Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.

=== Backpropogation ===

Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.

== Supervised Learning ==

During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.

=== Policy Network ===

The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.

=== Value Network ===

The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.

=== Policy-Value Network ===

The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:

* Algorithm: stochastic gradient descent
* Batch size: 256
* Momentum: 0.9
* L2 regularization: 0.0001
* Training time: ~100 epochs
* Learning rate: initialized at 0.01, reduced twice

A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:

[[File:curling_loss_function.png | 300px]]

== Self-Play Reinforcement Learning ==

After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.

At a game state ''s<sub>t</sub>'':

1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.

2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.

It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.

The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.

It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.

== Long-Term Strategy Learning ==

Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.

The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.

== Final Algorithms ==

The authors make use of the following versions of their algorithm:

=== KR-DL ===

''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.

=== KR-DRL ===

''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.

=== KR-DRL-MES ===

''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.

= Testing and Results =
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.

== Comparison of KR-DL-UCT and DL-UCT ==

The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.

As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.

<center>[[File:curling_KR_test.png | 400px]]</center>

== Matches ==

Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.

[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]

[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]

= Conclusion & Critique =

The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:

== Strengths ==

This algorithm out-performs other high-performance algorithms (including past competition champions).

I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.

The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.

== Weaknesses ==

Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.

While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.

At this point, the algorithms used still rely on initialization by a pre-made program.

There was little theoretical development or justification done in this paper.

While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.

While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.

=References=
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)
# https://www.baeldung.com/java-monte-carlo-tree-search
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/
# https://int8.io/monte-carlo-tree-search-beginners-guide/
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.

Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling

2018-11-30T21:30:27Z

S366chen: /* Introduction and Motivation */

This page provides a summary and critique of the paper '''Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling''' [[http://proceedings.mlr.press/v80/lee18b/lee18b.pdf Online Source]], published in ICML 2018. The source code for this paper is available [https://github.com/leekwoon/KR-DL-UCT here]

= Introduction and Motivation =

In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. More recently, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space; the number of actions a player can take and/or the number of possible game states are finite.

Interacting with the real world (e.g.; a scenario that involves moving physical objects) typically involves working with a continuous action space. It is thus important to develop strategies for dealing with continuous action spaces. Deep neural networks that are designed to succeed in finite action spaces are not necessarily suitable for continuous action space problems. This is due to the fact that deterministic discretization of a continuous action space causes strong biases in policy evaluation and improvement.

This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.

Curling is chosen as a domain to test the network on. Curling was chosen due to its large action space, potential for complicated strategies, and need for precise interactions.

== Curling ==

Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the game play, and potential challenges/concerns for learning algorithms. A terminology section follows.

=== Game play ===

A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.

When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house and can allow rocks to travel around other rocks. Team members are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).

Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:

* Draw: Throw a rock to a target location
* Freeze: Draw a rock up against another rock
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)

=== Challenges for AI ===

Curling offers many challenges for curling based on its physics and rules. This section lists a few concerns.

The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.

Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.

Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be a good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.

Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.

=== Terminology ===

* End: A round of the game
* House: The end of the sheet of ice, which contains
* Hammer: The team that throws the last rock of an end 'has the hammer'
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.

== Related Work ==

=== AlphaGo Lee ===

AlphaGo Lee (Silver et al., 2016, [5]) refers to an algorithm used to play the game Go, which was able to defeat international champion Lee Sedol.

Go game:
* Start with 19x19 empty board
* One player take black stones and the other take white stones
* Two players take turns to put stones on the board
* Rules:
1. If one connected part is completely surrounded by the opponents stones, remove it from the board

2. Ko rule: Forbids a board play to repeat a board position
* End when there is no valuable moves on the board.
* Count the territory of both players.
* Add 7.5 points to whites points (called Komi).
[[File:go.JPG|700px|center]]

Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.

The AlphaGo Lee policy network predicts the best move given a board configuration. It has a CNN architecture with 13 hidden layers, and it is trained using expert game play data and improved through self-play.

The value network evaluates the probability of winning given a board configuration. It consists of a CNN with 14 hidden layers, and it is trained using self-play data from the policy network.

Finally, the two networks are combined using Monte-Carlo Tree Search, which performs look ahead search to select the actions for game play.

The use of both policy and value networks are reflected in this paper's work.

=== AlphaGo Zero ===

AlphaGo Zero (Silver et al., 2017, [6]) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks and is trained on self-play, without the need of expert training.
Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.

This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.

This technique is more powerful than previous versions of AlphaGo because it is no longer constrained by the limits of human knowledge. Instead, it is able to learn tabula rasa from the strongest player in the world: AlphaGo itself.

Other differences from the previous AlphaGo iterations are as follows. AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features. It uses one neural network rather than two. Earlier versions of AlphaGo used a “policy network” to select the next move to play and a ”value network” to predict the winner of the game from each position. These are combined in AlphaGo Zero, allowing it to be trained and evaluated more efficiently. AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions. All of these differences help improve the performance of the system and make it more general. But it is the algorithmic change that makes the system much more powerful and efficient.

The unification of networks and self-play are also reflected in this paper.

=== Curling Algorithms ===

Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, [7]) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.

=== Monte Carlo Tree Search ===

Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail, balance exploration of different states, with knowledge of paths of execution through past games. An MCTS called <math>KR-UCT</math> which is able to find effective selections and use kernel regression (KR) and kernel density estimation(KDE) to estimate rewards using neighborhood information has been applied to continuous action space by researchers.

With bandit problem, scholars used hierarchical optimistic optimization(HOO) to create a cover tree and divide the action space into small ranges at different depths, where the most promising node will create fine granularity estimates.

=== Curling Physics and Simulation ===

Several references in the paper refer to the study and simulation of curling physics. Scholars have analyzed friction coefficients between curling stones and ice. While modelling the changes in friction on ice is not possible, a fixed friction coefficient was predefined in the simulation. The behavior of the stones was also modeled. Important parameters are trained from professional players. The authors used the same parameters in this paper.

== General Background of Algorithms ==

=== Policy and Value Functions ===

A policy function is trained to provide the best action to take, given a current state. Policy iteration is an algorithm used to improve a policy over time. This is done by alternating between policy evaluation and policy improvement.

POLICY IMPROVEMENT: LEARNING ACTION POLICY

Action policy <math> p_{\sigma}(a|s) </math> outputs a probability distribution over all eligible moves <math> a </math>. Here <math> \sigma </math> denotes the weights of a neural network that approximates the policy. <math>s</math> denotes the set of states and <math>a</math> denotes the set of actions taken in the environment. The policy is a function that returns a action given the state at which the agent is present. The policy gradient reinforcement learning can be used to train action policy. It is updated by stochastic gradient ascent in the direction that maximizes the expected outcome at each time step t,
\[ \Delta \rho \propto \frac{\partial p_{\rho}(a_t|s_t)}{\partial \rho} r(s_t) \]
where <math> r(s_t) </math> is the return.

POLICY EVALUATION: LEARNING VALUE FUNCTIONS

A value function is trained to estimate the value of a value of being in a certain state with parameter <math> \theta </math>. It is trained based on records of state-action-reward sets <math> (s, r(s)) </math> by using stochastic gradient de- scent to minimize the mean squared error (MSE) between the predicted regression value and the corresponding outcome,
\[ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}(r(s)-v_{\theta}(s)) \]

=== Monte Carlo Tree Search ===

Monte Carlo Tree Search (MCTS) is a search algorithm used for finite-horizon tasks (ex: in curling, only 16 moves, or throw stones, are taken each end).

MCTS is a tree search algorithm similar to minimax. However, MCTS is probabilistic and does not need to explore a full game tree or even a tree reduced with alpha-beta pruning. This makes it tractable for games such as GO, and curling.

Nodes of the tree are game states, and branches represent actions. Each node stores statistics on how many times it has been visited by the MCTS, as well as the number of wins encountered by playouts from that position. A node has been considered 'visited' if a full playout has started from that node. A node is considered 'expanded' if all its children have been visited.

MCTS begins with the '''selection''' phase, which involves traversing known states/actions. This involves expanding the tree by beginning at the root node, and selecting the child/score with the highest 'score'. From each successive node, a path down to a root node is explored in a similar fashion.

The next phase, '''expansion''', begins when the algorithm reaches a node where not all children have been visited (ie: the node has not been fully expanded). In the expansion phase, children of the node are visited, and '''simulations''' run from their states.

Once the new child is expanded, '''simulation''' takes place. This refers to a full playout of the game from the point of the current node, and can involve many strategies, such as randomly taken moves, the use of heuristics, etc.

The final phase is '''update''' or '''back-propagation''' (unrelated to the neural network algorithm). In this phase, the result of the '''simulation''' (ie: win/lose) is update in the statistics of all parent nodes.

A selection function known as Upper Confidence Bound (UCT) can be used for selecting which node to select. The formula for this equation is shown below [[https://www.baeldung.com/java-monte-carlo-tree-search source]]. Note that the first term essentially acts as an average score of games played from a certain node. The second term, meanwhile, will grow when sibling nodes are expanded. This means that unexplored nodes will gradually increase their UCT score, and be selected in the future.

<math> \frac{w_i}{n_i} + c \sqrt{\frac{\ln t}{n_i}} </math>

In which

* <math> w_i = </math> number of wins after <math> i</math>th move
* <math> n_i = </math> number of simulations after <math> i</math>th move
* <math> c = </math> exploration parameter (theoritically eqal to <math> \sqrt{2}</math>)
* <math> t = </math> total number of simulations for the parent node

Sources: 2,3,4

[[File:MCTS_Diagram.jpg | 500px|center]]

=== Kernel Regression ===

Kernel regression is a form of weighted averaging which uses a kernel function as a weight to estimate the conditional expectation of a random variable. Given two items of data, '''x''', each of which has a value '''y''' associated with them, and a choice of Kernel '''K''', the kernel functions outputs a weighting factor. An estimate of the value of a new, unseen point, is then calculated as the weighted average of values of surrounding points.

A typical kernel is a Gaussian kernel, shown below. The formula for calculating estimated value is shown below as well (sources: Lee et al.).

[[File:gaussian_kernel.png | 400 px]]

[[File:kernel_regression.png | 250 px]]

The denominator of the conditional expectation is related to kernel density estimation, which is defined as <math display="inline">W(x)=\sum_{i=0}^n K(x,x_i)</math>.

In this case, the combination of the two-act to weigh scores of samples closest to '''x''' more strongly.

= Methods =

== Variable Definitions ==

The following variables are used often in the paper:

* <math>s</math>: A state in the game, as described below as the input to the network.
* <math>s_t</math>: The state at a certain time-step of the game. Time-steps refer to full turns in the game
* <math>a_t</math>: The action taken in state <math>s_t</math>
* <math>A_t</math>: The actions taken for sibling nodes related to <math>a_t</math> in MCTS
* <math>n_{a_t}</math>: The number of visits to node a in MCTS
* <math>v_{a_t}</math>: The MCTS value estimate of a node

== Network Design ==

The authors design a CNN called the 'policy-value' network. The network consists of a common network structure, which is then split into 'policy' and 'value' outputs. This network is trained to learn a probability distribution of actions to take, and expected rewards, given an input state.

=== Shared Structure ===

The network consists of 1 convolutional layer followed by 9 residual blocks, each block consisting of 2 convolutional layers with 32 3x3 filters. The structure of this network is shown below:

[[File:curling_network_layers.png|600px|thumb|center|Figure 2. A detail description of our policy-value network. The shared network is composed of one convolutional layer and nine residual blocks. Each residual block (explained in b) has two convolutional layer with batch normalization (Ioffe & Szegedy, 2015[11]) followed by the addition of the input and the residual block. Each layer in the shared network uses 3x3 filters. The policy head
has two more convolutional layers, while the value head has two fully connected layers on top of a convolutional layer. For the activation function of each convolutional layer, ReLU (Nair & Hinton[12]) is used.]]

the input to this network is the following:
* Location of stones
* Order to tee (the center of the sheet)
* A 32x32 grid of representation of the ice sheet, representing which stones are present in each grid cell.

The authors do not describe how the stone-based information is added to the 32x32 grid as input to the network.

=== Policy Network ===

The policy head is created by adding 2 convolutional layers with 2 (two) 3x3 filters to the main body of the network. The output of the policy head is a distribution of probabilities of the actions to select the best shot out of a 32x32x2 set of actions. The actions represent target locations in the grid and spin direction of the stone.

[[File:policy-value-net.PNG | 700px]]

=== Value Network ===

The valve head is created by adding a convolution layer with 1 3x3 filter, and dense layers of 256 and 17 units, to the shared network. The 17 output units represent a probability of scores in the range of [-8,8], which are the possible scores at each end of a curling game.

== Continuous Action Search ==

The policy head of the network only outputs actions from a discretized action space. For real-life interactions, and especially in curling, this will not suffice, as very fine adjustments to actions can make significant differences in outcomes.

Actions in the continuous space are generated using an MCTS algorithm, with the following steps:

=== Selection ===

From a given state, the list of already-visited actions is denoted as A<sub>t</sub>. Scores and the number of visits to each node are estimated using the equations below (the first equation shows the expectation of the end value for one-end games). These are likely estimated rather than simply taken from the MCTS statistics to help account for the differences in a continuous action space.

[[File:curling_kernel_equations.png | 400px]]

The UCB formula is then used to select an action to expand.

The actions that are taken in the simulator appear to be drawn from a Gaussian centered around <math>a_t</math>. This allows exploration in the continuous action space.

=== Expansion ===

The authors use a variant of regular UCT for expansion. In this case, they expand a new node only when existing nodes have been visited a certain number of times. The authors utilize a widening approach to overcome problems with standard UCT performing a shallow search when there is a large action space.

=== Simulation ===

Instead of simulating with a random game playout, the authors use the value network to estimate the likely score associated with a state. This speeds up simulation (assuming the network is well trained), as the game does not actually need to be simulated.

=== Backpropogation ===

Standard backpropagation is used, updating both the values and number of visits stored in the path of parent nodes.

== Supervised Learning ==

During supervised training, data is gathered from the program AyumuGAT'16 ([8]). This program is also based on both an MCTS algorithm, and a high-performance AI curling program. 400 000 state-action pairs were generated during this training.

=== Policy Network ===

The policy network was trained to learn the action taken in each state. Here, the likelihood of the taken action was set to be 1, and the likelihood of other actions to be 0.

=== Value Network ===

The value network was trained by 'd-depth simulations and bootstrapping of the prediction to handle the high variance in rewards resulting from a sequence of stochastic moves' (quote taken from paper). In this case, ''m'' state-action pairs were sampled from the training data. For each pair, <math>(s_t, a_t)</math>, a state d' steps ahead was generated, <math>s_{t+d}</math>. This process dealt with uncertainty by considering all actions in this rollout to have no uncertainty, and allowing uncertainty in the last action, ''a<sub>t+d-1</sub>''. The value network is used to predict the value for this state, <math>z_t</math>, and the value is used for learning the value at ''s<sub>t</sub>''.

=== Policy-Value Network ===

The policy-value network was trained to maximize the similarity of the predicted policy and value, and the actual policy and value from a state. The learning algorithm parameters are:

* Algorithm: stochastic gradient descent
* Batch size: 256
* Momentum: 0.9
* L2 regularization: 0.0001
* Training time: ~100 epochs
* Learning rate: initialized at 0.01, reduced twice

A multi-task loss function was used. This takes the summation of the cross-entropy losses of each prediction:

[[File:curling_loss_function.png | 300px]]

== Self-Play Reinforcement Learning ==

After initialization by supervised learning, the algorithm uses self-play to further train itself. During this training, the policy network learns probabilities from the MCTS process, while the value network learns from game outcomes.

At a game state ''s<sub>t</sub>'':

1) the algorithm outputs a prediction ''z<sub>t</sub>''. This is en estimate of game score probabilities. It is based on similar past actions, and computed using kernel regression.

2) the algorithm outputs a prediction <math>\pi_t</math>, representing a probability distribution of actions. These are proportional to estimated visit counts from MCTS, based on kernel density estimation.

It is not clear how these predictions are created. It would seem likely that the policy-value network generates these, but the wording of the paper suggests they are generated from MCTS statistics.

The policy-value network is updated by sampling data <math>(s, \pi, z)</math> from recent history of self-play. The same loss function is used as before.

It is not clear how the improved network is used, as MCTS seems to be the driving process at this point.

== Long-Term Strategy Learning ==

Finally, the authors implement a new strategy to augment their algorithm for long-term play. In this context, this refers to playing a game over many ends, where the strategy to win a single end may not be a good strategy to win a full game. For example, scoring one point in an end, while being one point ahead, gives the advantage to the other team in the next round (as they will throw the last stone). The other team could then use the advantage to score two points, taking the lead.

The authors build a 'winning percentage' table. This table stores the percentage of games won, based on the number of ends left, and the difference in score (current team - opposing team). This can be computed iteratively and using the probability distribution estimation of one-end scores.

== Final Algorithms ==

The authors make use of the following versions of their algorithm:

=== KR-DL ===

''Kernel regression-deep learning'': This algorithm is trained only by supervised learning.

=== KR-DRL ===

''Kernel regression-deep reinforcement learning'': This algorithm is trained by supervised learning (ie: initialized as the KR-DL algorithm), and again on self-play. During self-play, each shot is selected after 400 MCTS simulations of k=20 randomly selected actions. Data for self-play was collected over a week on 5 GPUS and generated 5 million game positions. The policy-value network was continually updated using samples from the latest 1 million game positions.

=== KR-DRL-MES ===

''Kernel regression-deep reinforcement learning-multi-ends-strategy'': This algorithm makes use of the winning percentage table generated from self-play.

= Testing and Results =
The authors use data from the public program AyumuGAT’16 to test. Testing is done with a simulated curling program [9]. This simulator does not deal with changing ice conditions, or sweeping, but does deal with stone trajectories and collisions.

== Comparison of KR-DL-UCT and DL-UCT ==

The first test compares an algorithm trained with kernel regression with an algorithm trained without kernel regression, to show the contribution that kernel regression adds to the performance. Both algorithms have networks initialised with the supervised learning, and then trained with two different algorithms for self-play. KR-DL-UCT uses the algorithm described above. The authors do not go into detail on how DL-UCT selects shots, but state that a constant is set to allow exploration.

As an evaluation, both algorithms play 2000 games against the DL-UCT algorithm, which is frozen after supervised training. 1000 games are played with the algorithm taking the first, and 100 taking the 2nd, shots. The games were two-end games. The figure below shows each algorithm's winning percentage given different amounts of training data. While the DL-UCT outperforms the supervised-training-only-DL-UCT algorithm, the KR-DL-UCT algorithm performs much better.

<center>[[File:curling_KR_test.png | 400px]]</center>

== Matches ==

Finally, to test the performance of their multiple algorithms, the authors run matches between their algorithms and other existing programs. Each algorithm plays 200 matches against each other program, 100 of which are played as the first-playing team, and 100 as the second-playing team. Only 1 program was able to out-perform the KR-DRL algorithm. The authors state that this program, ''JiritsukunGAT'17'' also uses a deep network and hand-crafted features. However, the KR-DRL-MES algorithm was still able to out-perform this. Figure 4 shows the Elo ratings of the different programs. Note that the programs in blue are those created by the authors. They also played some games between their KR-DRL-MES and notable
programs. Table 1, shows the details of the match results. ''JiritsukunGAT'17'' shows a similar level of performance but KR-DRL-MES is still the winner.

[[File:curling_ratings.png|600px|thumb|center|Figure 4. Elo rating and winning percentages of our models and GAT rankers. Each match has 200 games (each program plays 100 pre-ordered games), because the player which has the last shot (the hammer shot) in each end would have an advantage.]]

[[File:ttt.png|600px|thumb|center|Table 1. The 8-end game results for KR-DRL-MES against other programs alternating the opening player each game. The matches are held by following the rules of the latest GAT competition.]]

= Conclusion & Critique =

The authors have presented a new framework which incorporates a deep neural network for learning game strategy with a kernel-based Monte Carlo tree search from a continuous space. Without the use of any hand-crafted feature, their policy-value network is successfully trained using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling. Following are my critiques on the paper:

== Strengths ==

This algorithm out-performs other high-performance algorithms (including past competition champions).

I think the paper does a decent job of comparing the performance of their algorithm to others. They are able to clearly show the benefits of many of their additions.

The authors do seem to be able to adopt strategies similar to those used in Go and other games to the continuous action-space domain. In addition, the final strategy needs no hand-crafted features for learning.

== Weaknesses ==

Somtimes, I found this paper difficult to follow. One problem was that the algorithms were introduced first, and then how they were used was described. So when the paper stated that self-play shots were taken after 400 simulations, it seemed unclear what simulations were being run and at what stage of the algorithm (ex: MCTS simulations, simulations sped up by using the value network, full simulations on the curling simulator). In particular, both the MCTS statistics and the policy-value network could be used to estimate both action probabilities and state values, so it is difficult to tell which is used in which case. There was also no clear distinction between discrete-space actions and continuous-space actions.

While I think the comparison of different algorithms was done well, I believe it still lacked significant details. There were one-off mentioned in the paper which would have been nice to see as results. These include the statement that having a policy-value network in place of two networks lead to better performance.

At this point, the algorithms used still rely on initialization by a pre-made program.

There was little theoretical development or justification done in this paper.

While curling is an interesting choice for demonstrating the algorithm, the fact that the simulations used did not support many of the key points of curling (ice conditions, sweeping) seems very limited. Another game, such as pool, would likely have offered some of the same challenges but offered more high-fidelity simulations/training.

While the spatial placements of stones were discretized in a grid, the curl of thrown stones was discretized to only +/-1. This seems like it may limit learning high- and low-spin moves. It should be noted that having zero spins is not commonly used, to the best of my knowledge.

=References=
# Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)
# https://www.baeldung.com/java-monte-carlo-tree-search
# https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/
# https://int8.io/monte-carlo-tree-search-beginners-guide/
# https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
# Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T.,Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. Mastering the game of go with deep neural networksand tree search. Nature, pp. 484–489, 2016.
# Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,van den Driessche, G., Graepel, T., and Hassabis, D.Mastering the game of go without human knowledge.Nature, pp. 354–359, 2017.
# Yamamoto, M., Kato, S., and Iizuka, H. Digital curling strategy based on game tree search. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 474–480, 2015.
# Ohto, K. and Tanaka, T. A curling agent based on the montecarlo tree search considering the similarity of the best action among similar states. In Proceedings of Advances in Computer Games, ACG, pp. 151–164, 2017.
# Ito, T. and Kitasei, Y. Proposal and implementation of digital curling. In Proceedings of the IEEE Conference on Computational Intelligence and Games, CIG, pp. 469–473, 2015.
# Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, ICML, pp. 448–456, 2015.
# Nair, V. and Hinton, G. Rectified linear units improve restricted boltzmann machines.

Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling

2018-11-30T21:30:11Z

S366chen: /* Introduction and Motivation */

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-11-30T20:30:25Z

S366chen: /* Related Work */

=Introduction=

It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.

With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].

In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix. This approach is efficient because it avoids searching over an exponential solution space of interaction candidates by making an approximation of hidden unit importance at the first hidden layer via all weights above and doing a 2D traversal of the input weight matrix.

Note that in this paper, we only consider one specific types of neural network, feedforward neural network. Based on the methodology discussed here, the authors suggest that we can build an interpretation method for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves. Two-way ANOVA has been a standard method of performing pairwise interaction detection that involves conducting hypothesis tests for each interaction candidate by checking each hypothesis with F-statistics (Wonnacott & Wonnacott, 1972). Additive Groves is another method that conducts individual tests for interactions and hence must face the same computational difficulties; however, it is special because the interactions it detects are not constrained to any functional form.
* Define all interaction forms of interest, then later finds the important ones.

- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations.
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.
* Sum product networks, Hoifun Poon, Pedro Domingos (2011) It is a new deep architecture that provides clear semantics. In its core, it is a probabilistic model, with two types of nodes: Sum node and
Product nodes. The sum nodes are trying to model the mixture of distributions and product node is trying to model joint distributions. It can be trained using gradient descent and other methods as well. The main advantage of the Sum-Product Network is that it has clear semantics, where people can interpret exactly how the network models make decisions. Therefore, it has better interpretability than most of the current deep architectures.

The approach in this paper is to extract non-additive interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.

Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer. The figure below illustrates an interaction within a fully connected feedforward neural network, where the box contains later layers in the network.

[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.
[[File:algorithm1.PNG|850px|center]]

=Cut-off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,

<center><math>
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)
</math></center>

From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. In the experiments that the authors performed, all the networks which modelled feature interactions consisted of four hidden layers containing 140, 100, 60, and 20 units respectively. Whereas, all the individual univariate networks contained three hidden layers with each layer containing 10 units. All of these networks used ReLu activation and backpropagation for training. The MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, the authors study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, the authors are going to test on 10 synthetic functions as shown in table I.

[[File:synthetic.PNG|900px|center]]

The authors use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.

And the authors also reported the results of comparisons between the models. As you can see, neural network based models are performing better on average. Compare to the traditional methods like ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly capture the most influential pair-wise interactions.

=Highe-order interatcion detection=
The authors use their greedy interaction ranking algorithm to perform higher-order interactiondetection without an exponential search of interaction candidates.
[[File:higher-order_interaction_detection.png|700px|center]]

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.

For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks. Also, it was pointed out that the neural network weights heavily depend on L-1 regularized neural network training, but a group lasso penalty may work better.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

3. Greedy algorithm is implemented but nothing is mentioned about the speed of this algorithm which is definitely not fast. So, this has the potential to be a weak point of the study.

=Reference=

[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.

[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.

[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.

[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.

[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.

[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS

2018-11-30T20:30:09Z

S366chen: /* Related Work */

=Introduction=

It has been commonly believed that one major advantage of neural networks is their capability of modelling complex statistical interactions between features for automatic feature learning. Statistical interactions capture important information on where features often have joint effects with other features on predicting an outcome. The discovery of interactions is especially useful for scientific discoveries and hypothesis validation. For example, physicists may be interested in understanding what joint factors provide evidence for new elementary particles; doctors may want to know what interactions are accounted for in risk prediction models, to compare against known interactions from existing medical literature.

With the growth in the computational power available Neural Networks have been able to solve many of the complex tasks in a wide variety of fields. This is mainly due to their ability to model complex and non-linear interactions. Neural networks have traditionally been treated as “black box” models, preventing their adoption in many application domains, such as those where explainability is desirable. It has been noted that complex machine learning models can learn unintended patterns from data, raising significant risks to stakeholders [14]. Therefore, in applications where machine learning models are intended for making critical decisions, such as healthcare or finance, it is paramount to understand how they make predictions [9]. Within several areas, like eg: computation social science, interpretability is of utmost importance. Since we do not understand how a neural network comes to its decision, practitioners in these areas tend to prefer simpler models like linear regression, decision trees, etc. which are much more interpretable. In this paper, we are going to present one way of implementing interpretability in a neural network.

Existing approaches to interpreting neural networks can be summarized into two types. One type is direct interpretation, which focuses on 1) explaining individual feature importance, for example by computing input gradients [13] and decomposing predictions [8], 2) developing attention-based models, which illustrate where neural networks focus during inference [11], and 3) providing model-specific visualizations, such as feature map and gate activation visualizations [15]. The other type is indirect interpretation, for example post-hoc interpretations of feature importance [12] and knowledge distillation to simpler interpretable models [10].

In this paper, the authors propose Neural Interaction Detection (NID), which can detect any order or form of statistical interaction captured by the feedforward neural network by examining its weight matrix. This approach is efficient because it avoids searching over an exponential solution space of interaction candidates by making an approximation of hidden unit importance at the first hidden layer via all weights above and doing a 2D traversal of the input weight matrix.

Note that in this paper, we only consider one specific types of neural network, feedforward neural network. Based on the methodology discussed here, the authors suggest that we can build an interpretation method for other types of networks also.

=Related Work=

1. Interaction Detection approaches:
* Conduct individual tests for all features' combination such as ANOVA and Additive Groves. Two-way ANOVA has been a standard method of performing pairwise interaction detection
that involves conducting hypothesis tests for each interaction candidate by checking each hypothesis with F-statistics (Wonnacott & Wonnacott, 1972). Additive Groves is another method that conducts individual tests for interactions and hence must face the same computational difficulties; however, it is special because the interactions it detects are not constrained to any functional form.
* Define all interaction forms of interest, then later finds the important ones.

- The paper's goal is to detect interactions without compromising the functional forms. Our method accomplishes higher-order interaction detection, which has the benefit of avoiding a high false positive or false discovery rate.

2. Interpretability: A lot of work has also been done in this particular area and it can be divided it the following broad categories:
* Feature Importance through Decomposition: Methods like Input Gradient(Sundararajan et al., 2017) learns the importance of features through a gradient-based approach similar to backpropagation. Works like Li et al(2017), Murdoch(2017) and Murdoch(2018) study interpretability of LSTMs by looking at phrase and word level importance scores. Bach et al. 2015 and Shrikumar et al. 2016 (DeepLift) study pixel importance in CNNs.
* Studying Visualizations in Models - Karpathy et al. (2015) worked with character generating LSTMs and tried to study activation and firing in certain hidden units for meaningful attributes. (Yosinski et al., 2015 studies feature map visualizations.
* Attention-Based Models: Bahdanau et al. (2014) - These are a different class of models which use attention modules(different architectures) to help focus the neural network to decide the parts of the input that it should look more closely or give more importance to. Looking at the results of these type of model an indirect sense of interpretability can be gauged.
* Sum product networks, Hoifun Poon, Pedro Domingos (2011) It is a new deep architecture that provides clear semantics. In its core, it is a probabilistic model, with two types of nodes: Sum node and
Product nodes. The sum nodes are trying to model the mixture of distributions and product node is trying to model joint distributions. It can be trained using gradient descent and other methods as well. The main advantage of the Sum-Product Network is that it has clear semantics, where people can interpret exactly how the network models make decisions. Therefore, it has better interpretability than most of the current deep architectures.

The approach in this paper is to extract non-additive interactions between variables from the neural network weights.

=Notations=
Before we dive in to methodology, we are going to define a few notations here. Most of them will be trivial.

1. Vector: Vectors are defined with bold-lowercases, '''v, w'''

2. Matrix: Matrice are defined with blod-uppercases, '''V, W'''

3. Interger Set: For some interger p <math>\in</math> Z, we define [p] := {1,2,3,...,p}

=Interaction=
First of all, in order to explain the model, we need to be able to explain the interactions and their effects to output. Therefore, we define 'interacion' between variables as below.

[[File:def_interaction.PNG|900px|center]]

From the definition above, for a function like, <math>x_1x_2 + sin(x_3 + x_4 + x_5)</math>, we have <math>{[x_1, x_2]}</math> and <math>{[x_3, x_4, x_5]}</math> interactions. And we say that the latter interaction to be 3-way interaction.

Note that from the definition above, we can naturally deduce that d-way interaction can exist if and only if all of its (d-1) interactions exist. For example, 3-way interaction above shows that we have 2-way interactions <math>{[3,4], [4,5]}</math> and <math>{[3,5]}</math>.

One thing that we need to keep in mind is that for models like neural network, most of interactions are happening within hidden layers. This means that we needa proper way of measuring interaction strength.

The key observation is that for any kinds of interaction, at a some hidden unit of some hidden layer, two interacting features the ancestors. In graph-theoretical language, interaction map can be viewed as an associated directed graph and for any interaction <math>\Gamma \in [p]</math>, there exists at least one vertix that has all of features of <math>\Gamma</math> as ancestors. The statement can be rigorized as the following:

[[File:prop2.PNG|900px|center]]

Now, the above mathematical statement gurantees us to measure interaction strengths at ANY hidden layers. For example, if we want to study about interactions at some specific hidden layer, now we now that there exists corresponding vertices between the hidden layer and output layer. Therefore all we need to do is now to find approprite measure which can summarize the information between those two layers.

Before doing so, let's think about a single-layered neural network. For any one hidden unit, we can have possibly, <math>2^{||W_i,:||}</math>, number of interactions. This means that our search space might be too huge for multi-layered networks. Therefore, we need a some descent way of approximate out search space. Moreover, the authors realized a fast interaction detection by limiting the search complexity of the task by only quantifying interactions created at the first hidden layer. The figure below illustrates an interaction within a fully connected feedforward neural network, where the box contains later layers in the network.

[[File:network1.PNG|500px|center]]

==Measuring influence in hidden layers==
As we discussed above, in order to consider interaction between units in any layers, we need to think about their out-going paths. However, we soon encountered the fact that for some fully-connected multi-layer neural network, the search space might be too huge to compare. Therefore, we use information about out-going paths gredient upper bond. To represent the influence of out-going paths at <math>l</math>-hidden layer, we define cumulative impact of weights between output layer and <math>l+1</math>. We define aggregated weights as,

[[File:def3.PNG|900px|center]]

Note that <math>z^{(l)} \in R^{(p_l)}</math> where <math>p_l</math> is the number of hidden units in <math>l</math>-layer.
Moreover, this is the lipschitz constant of gredients. Gredient has been an import variable of measuring influence of features, especially when we consider that input layer's derivative computes the direction normal to decision boundaries.

==Quantifying influence==
For some <math>i</math> hidden unit at the first hidden layer, which is the closet layer to the input layer, we define the influence strength of some interaction as,

[[File:measure1.PNG|900px|center]]

The function <math>\mu</math> will be defined later. Essentially, the formula shows that the strength of influence is defined as the product of the aggregated weight on the first hidden layer and some measure of influence between the first hidden layer and the input layer.

For the function, <math>\mu</math>, any positive-real valued functions such as max, min and average can be candidates. The effects of those candidates will be tested later.

Now based on the specifications above, the author suggested the algorithm for searching influential interactions between input layer units as follows:

It was pointed out that restricting to the first hidden layer might miss some important feature interactions, however, the author state that it is not straightforward how to incorporate the idea of hidden units at intermediate layers to get better interaction detection performance.
[[File:algorithm1.PNG|850px|center]]

=Cut-off Model=
Now using the greedy algorithm defined above, we can rank the interactions by their strength. However, in order to access true interactions, we are building the cut-off model which is a generalized additive model (GAM) as below,

<center><math>
c_K('''x''') = \sum_{i=1}^{p}g_i(x_i) + \sum_{i=1}^{K}{g_i}^\prime(x_\chi)
</math></center>

From the above model, each <math>g</math> and <math>g^*</math> are Feed-Forward neural network. We are keep adding interactions until the performance reaches plateaus.

=Experiment=
For the experiment, the authors have compared three neural network model with traditional statistical interaction detecting algorithms. For the nueral network models, first model will be MLP, second model will be MLP-M, which is MLP with additional univariate network at the output. The last one is the cut-off model defined above, which is denoted by MLP-cutoff. In the experiments that the authors performed, all the networks which modelled feature interactions consisted of four hidden layers containing 140, 100, 60, and 20 units respectively. Whereas, all the individual univariate networks contained three hidden layers with each layer containing 10 units. All of these networks used ReLu activation and backpropagation for training. The MLP-M model is graphically represented below.

[[File:output11.PNG|300px|center]]

For the experiment, the authors study our interaction detection framework on both simulated and real-world experiments. For simulated experiments, the authors are going to test on 10 synthetic functions as shown in table I.

[[File:synthetic.PNG|900px|center]]

The authors use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset.

And the authors also reported the results of comparisons between the models. As you can see, neural network based models are performing better on average. Compare to the traditional methods like ANOVA, MLP and MLP-M method shows 20% increases in performance.

[[File:performance_mlpm.PNG|900px|center]]

[[File:performance2_mlpm.PNG|900px|center]]

The above result shows that MLP-M almost perfectly capture the most influential pair-wise interactions.

=Highe-order interatcion detection=
The authors use their greedy interaction ranking algorithm to perform higher-order interactiondetection without an exponential search of interaction candidates.
[[File:higher-order_interaction_detection.png|700px|center]]

=Limitations=
Even though for the above synthetic experiment MLP methods showed superior performances, the method still have some limitations. For example, fir the function like, <math>x_1x_2 + x_2x_3 + x_1x_3</math>, neural network fails to distinguish between interlinked interactions to single higher order interaction. Moreoever, correlation between features deteriorates the ability of the network to distinguish interactions. However, correlation issues are presented most of interaction detection algorithms.

Because this method relies on the neural network fitting the data well, there are some additional concerns. Notably, if the NN is unable to make an appropriate fit (under/overfitting), the resulting interactions will be flawed. This can occur if the datasets that are too small or too noisy, which often occurs in practical settings.

=Conclusion=
Here we presented the method of detecting interactions using MLP. Compared to other state-of-the-art methods like Additive Groves (AG), the performances are competitive yet computational powers required is far less. Therefore, it is safe to claim that the method will be extremly useful for practitioners with (comparably) less computational powers. Moreover, the NIP algorithm successfully reduced the computation sizes. After all, the most important aspect of this algorithm is that now users of nueral networks can impose interpretability in the model usage, which will change the level of usability to another level for most of practitioners outside of those working in machine learning and deep learning areas.

For future work, the authors want to detect feature interactions by using the common units in the intermediate hidden layers of feedforward networks, and also want to use such interaction detection to interpret weights in other deep neural networks. Also, it was pointed out that the neural network weights heavily depend on L-1 regularized neural network training, but a group lasso penalty may work better.

=Critique=
1. Authors need to do large-scale experiments, instead of just conducting experiments on some synthetic dataset with small feature dimensionality, to make their claim stronger.

2. Although the method proposed in this paper is interesting, the paper would benefit from providing some more explanations to support its idea and fill the possible gaps in its experimental evaluation. In some parts there are repetitive explanations that could be replaced by other essential clarifications.

3. Greedy algorithm is implemented but nothing is mentioned about the speed of this algorithm which is definitely not fast. So, this has the potential to be a weak point of the study.

=Reference=

[1] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[2] G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.

[3] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.

[4] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.

[5] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. International Conference on Learning Representations, 2018.

[6] Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine Learning: ECML 2007, pp. 323–334, 2007.

[7] Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006

[8] Sebastian Bach, Alexander Binder, Gre ́goire Montavon, Frederick Klauschen, Klaus-Robert Mu ̈ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel- ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.

[10] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371. American Medical Informatics Association, 2016.

[11] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254– 1259, 1998.

[12] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.

[13]Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[14] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys- tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.

[15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

Fix your classifier: the marginal value of training the last weight layer

2018-11-30T19:30:27Z

S366chen: /* Introduction */

The code for the proposed model is available at https://github.com/eladhoffer/fix_your_classifier.

=Introduction=

Deep neural networks have become widely used for machine learning, achieving state-of-the-art results on many tasks. One of the most common tasks they are used for is classification. For example, convolutional neural networks (CNNs) are used to classify images to a semantic category. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more computational resources. Thus, extensive amount of research has been done lately to reduce the size of networks. Han et al. (2015) used weight sharing and specification, Micikevicius et al. (2017) used mixed precision to reduce the size of the neural networks by half. Tai et al. (2015) and Jaderberg et al. (2014) used low rank approximations to speed up NNs. Hubara et al. (2016b), Li et al. (2016) and Zhou et al. (2016), used a more aggressive approach, in which weights, activations and gradients were quantized to further reduce computation during training. Although aggressive quantization benefits from smaller model size, the extreme compression rate comes with a loss of accuracy.
Past work noted the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on several tasks. In this study suggest the reversed proposal - that common NN models used can learn useful representation even without modifying the final output layer, which often holds a large number of parameters that grows linearly with number of classes.

=Brief Overview=

In order to alleviate the aforementioned problem, the authors propose that the final layer of the classifier be fixed (up to a global scale constant). They argue that with little or no loss of accuracy for most classification tasks, the method provides significant memory and computational benefits. In addition, they show that by initializing the classifier with a Hadamard matrix the inference could be made faster as well.

=Previous Work=

Training NN models and using them for inference requires large amounts of memory and computational resources; thus, extensive amount of research has been done lately to reduce the size of networks which are as follows:

* Weight sharing and specification (Han et al., 2015)

* Mixed precision to reduce the size of the neural networks by half (Micikevicius et al., 2017)

* Low-rank approximations to speed up CNN (Tai et al., 2015)

* Quantization of weights, activations, and gradients to further reduce computation during training (Hubara et al., 2016b; Li et al., 2016 and Zhou et al., 2016)

Some of the past works have also put forward the fact that predefined (Park & Sandberg, 1991) and random (Huang et al., 2006) projections can be used together with a learned affine transformation to achieve competitive results on many of the classification tasks. However, the authors' proposal in the current paper is quite reversed.

=Background=

A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.

A CNN consists of a number of convolutional and subsampling layers optionally followed by fully connected layers. The input to a convolutional layer is a <math>m \times m \times r</math> image where m is the height and width of the image and <math>r</math> is the number of channels, e.g. an RGB image has <math>r=3</math>. The convolutional layer will have <math>k</math> filters (or kernels) of size <math>n \times n \times q</math> where <math>n</math> is smaller than the dimension of the image and <math>q</math> can either be the same as the number of channels <math>r</math> or smaller and may vary for each kernel. The size of the filters gives rise to the locally connected structure which are each convolved with the image to produce <math>k</math> feature maps of size <math>m−n+1</math>. Each map is then subsampled typically with mean or max pooling over <math>p \times p</math> contiguous regions where <math>p</math> ranges between 2 for small images (e.g. MNIST) and is usually not more than 5 for larger inputs. Either before or after the subsampling layer an additive bias and sigmoidal nonlinearity is applied to each feature map.

CNNs are commonly used to solve a variety of spatial and temporal tasks. Earlier architectures of CNNs (LeCun et al., 1998; Krizhevsky et al., 2012) used a set of fully-connected layers at later stages of the network, presumably to allow classification based on global features of an image.

== Shortcomings of the Final Classification Layer and its Solution ==

Zeiler & Fergus, 2014 showed that despite the enormous number of trainable parameters these layers add to the model, they are known to have a rather marginal impact on the final performance of the network.

It has been shown previously that these layers could be easily compressed and reduced after a model was trained by simple means such as matrix decomposition and sparsification (Han et al., 2015). Modern architecture choices are characterized with the removal of most of the fully connected layers (Lin et al., 2013; Szegedy et al., 2015; He et al., 2016), that lead to better generalization and overall accuracy, together with a huge decrease in the number of trainable parameters. Additionally, numerous works showed that CNNs can be trained in a metric learning regime (Bromley et al., 1994; Schroff et al., 2015; Hoffer & Ailon, 2015), where no explicit classification layer was introduced and the objective regarded only distance measures between intermediate representations. Hardt & Ma (2017) suggested an all-convolutional network variant, where they kept the original initialization of the classification layer fixed with no negative impact on performance on the CIFAR-10 dataset.

=Proposed Method=

The aforementioned works provide evidence that fully-connected layers are in fact redundant and play a small role in learning and generalization. In this work, the authors have suggested that the parameters used for the final classification transform are completely redundant, and can be replaced with a predetermined linear transform. This holds for even in large-scale models and classification tasks, such as recent architectures trained on the ImageNet benchmark (Deng et al., 2009).

==Using a Fixed Classifier==

Suppose the final representation obtained by the network (the last hidden layer) is represented as <math>x = F(z;\theta)</math> where <math>F</math> is assumed to be a deep neural network with input z and parameters θ, e.g., a convolutional network, trained by backpropagation.

In common NN models, this representation is followed by an additional affine transformation, <math>y = W^T x + b</math> ,where <math>W</math> and <math>b</math> are also trained by back-propagation.

For input <math>x</math> of <math>N</math> length, and <math>C</math> different possible outputs, <math>W</math> is required to be a matrix of <math>N ×
C</math>. Training is done using cross-entropy loss, by feeding the network outputs through a softmax activation

<math>
v_i = \frac{e^{y_i}}{\sum_{j}^{C}{e^{y_j}}}, i ∈ </math> { <math> {1, . . . , C} </math> }

and reducing the expected negative log likelihood with respect to ground-truth target <math> t ∈ </math> { <math> {1, . . . , C} </math> },
by minimizing the loss function:

<math>
L(x, t) = −\text{log}\ {v_t} = −{w_t}·{x} − b_t + \text{log} ({\sum_{j}^{C}e^{w_j . x + b_j}})
</math>

where <math>w_i</math> is the <math>i</math>-th column of <math>W</math>.

==Choosing the Projection Matrix==

To evaluate the conjecture regarding the importance of the final classification transformation, the trainable parameter matrix <math>W</math> is replaced with a fixed orthonormal projection <math> Q ∈ R^{N×C} </math>, such that <math> ∀ i ≠ j : q_i · q_j = 0 </math> and <math> || q_i ||_{2} = 1 </math>, where <math>q_i</math> is the <math>i</math>th column of <math>Q</math>. This is ensured by a simple random sampling and singular-value decomposition

As the rows of classifier weight matrix are fixed with an equally valued <math>L_{2}</math> norm, we find it beneficial
to also restrict the representation of <math>x</math> by normalizing it to reside on the <math>n</math>-dimensional sphere:

<center><math>
\hat{x} = \frac{x}{||x||_{2}}
</math></center>

This allows faster training and convergence, as the network does not need to account for changes in the scale of its weights. However, it has now an issue that <math>q_i · \hat{x} </math> is bounded between −1 and 1. This causes convergence issues, as the softmax function is scale sensitive, and the network is affected by the inability to re-scale its input. This issue is amended with a fixed scale <math>T</math> applied to softmax inputs <math>f(y) = softmax(\frac{1}{T}y)</math>, also known as the ''softmax temperature''. However, this introduces an additional hyper-parameter which may differ between networks and datasets. So, the authors propose to introduce a single scalar parameter <math>\alpha</math> to learn the softmax scale, effectively functioning as an inverse of the softmax temperature <math>\frac{1}{T}</math>. The normalized weights and an additional scale coefficient are also used, specially using a single scale for all entries in the weight matrix. The additional vector of bias parameters <math>b ∈ \mathbb{R}^{C}</math> is kept the same and the model is trained using the traditional negative-log-likelihood criterion. Explicitly, the classifier output is now:

<center>
<math>
v_i=\frac{e^{\alpha q_i · \hat{x} + b_i}}{\sum_{j}^{C} e^{\alpha q_j · \hat{x} + b_j}}, i ∈ </math> { <math> {1,...,C} </math>}
</center>

and the loss to be minimized is:

<center><math>
L(x, t) = -\alpha q_t · \frac{x}{||x||_{2}} + b_t + \text{log} (\sum_{i=1}^{C} \text{exp}((\alpha q_i · \frac{x}{||x||_{2}} + b_i)))
</math></center>

where <math>x</math> is the final representation obtained by the network for a specific sample, and <math> t ∈ </math> { <math> {1, . . . , C} </math> } is the ground-truth label for that sample. The behaviour of the parameter <math> \alpha </math> over time, which is logarithmic in nature and has the same behavior exhibited by the norm of a learned classifier, is shown in
[[Media: figure1_log_behave.png| Figure 1]].

<center>[[File:figure1_log_behave.png]]</center>

When <math> -1 \le q_i · \hat{x} \le 1 </math>, a possible cosine angle loss is

<center>[[File:caloss.png]]</center>

But its final validation accuracy has a slight decrease, compared to original models.

==Using a Hadmard Matrix==

To recall, Hadmard matrix (Hedayat et al., 1978) <math> H </math> is an <math> n × n </math> matrix, where all of its entries are either +1 or −1.
Furthermore, <math> H </math> is orthogonal, such that <math> HH^{T} = nI_n </math> where <math>I_n</math> is the identity matrix. Instead of using the entire Hadmard matrix <math>H</math>, a truncated version, <math> \hat{H} ∈ </math> {<math> {-1, 1}</math>}<math>^{C \times N}</math> where all <math>C</math> rows are orthogonal as the final classification layer is such that:

<center><math>
y = \hat{H} \hat{x} + b
</math></center>

This usage allows two main benefits:
* A deterministic, low-memory and easily generated matrix that can be used for classification.
* Removal of the need to perform a full matrix-matrix multiplication - as multiplying by a Hadamard matrix can be done by simple sign manipulation and addition.

Here, <math>n</math> must be a multiple of 4, but it can be easily truncated to fit normally defined networks. Also, as the classifier weights are fixed to need only 1-bit precision, it is now possible to focus our attention on the features preceding it.

=Experimental Results=

The authors have evaluated their proposed model on the following datasets:

==CIFAR-10/100==

===About the Dataset===

CIFAR-10 is an image classification benchmark dataset containing 50,000 training images and 10,000 test images. The images are in color and contain 32×32 pixels. There are 10 possible classes of various animals and vehicles. CIFAR-100 holds the same number of images of the same size, but contains 100 different classes.

===Training Details===

The authors trained a residual network ( He et al., 2016) on the CIFAR-10 dataset. The network depth was 56 and the same hyper-parameters as in the original work were used. A comparison of the two variants, i.e., the learned classifier and the proposed classifier with a fixed transformation is shown in [[Media: figure1_resnet_cifar10.png | Figure 2]].

<center>[[File: figure1_resnet_cifar10.png]]</center>

These results demonstrate that although the training error is considerably lower for the network with learned classifier, both models achieve the same classification accuracy on the validation set. The authors' conjecture is that with the new fixed parameterization, the network can no longer increase the norm of a given sample’s representation - thus learning its label requires more effort. As this may happen for specific seen samples - it affects only training error.

The authors also compared using a fixed scale variable <math>\alpha </math> at different values vs. the learned parameter. Results for <math> \alpha = </math> {0.1, 1, 10} are depicted in [[Media: figure3_alpha_resnet_cifar.png| Figure 3]] for both training and validation error and as can be seen, similar validation accuracy can be obtained using a fixed scale value (in this case <math>\alpha </math>= 1 or 10 will suffice) at the expense of another hyper-parameter to seek. In all the further experiments the scaling parameter <math> \alpha </math> was regularized with the same weight decay coefficient used on original classifier. Although learning the scale is not necessary, but it will help convergence during training.

<center>[[File: figure3_alpha_resnet_cifar.png]]</center>

The authors then train the model on CIFAR-100 dataset. They used the DenseNet-BC model from Huang et al. (2017) with a depth of 100 layers and k = 12. The higher number of classes caused the number of parameters to grow and encompassed about 4% of the whole model. However, validation accuracy for the fixed-classifier model remained equally good as the original model, and the same training curve was observed as earlier.

==IMAGENET==

===About the Dataset===

The Imagenet dataset introduced by Deng et al. (2009) spans over 1000 visual classes, and over 1.2 million samples. This is supposedly a more challenging dataset to work on as compared to CIFAR-10/100.

===Experiment Details===

The authors evaluated their fixed classifier method on Imagenet using Resnet50 by He et al. (2016) and Densenet169 model (Huang et al., 2017) as described in the original work. Using a fixed classifier removed approximately 2-million parameters were from the model, accounting for about 8% and 12 % of the model parameters respectively. The experiments revealed similar trends as observed on CIFAR-10.

For a more stricter evaluation, the authors also trained a Shufflenet architecture (Zhang et al., 2017b), which was designed to be used in low memory and limited computing platforms and has parameters making up the majority of the model. They were able to reduce the parameters to 0.86 million as compared to 0.96 million parameters in the final layer of the original model. Again, the proposed modification in the original model gave similar convergence results on validation accuracy. Interestingly, this method allowed Imagenet training in an under-specified regime, where there are
more training samples than the number of parameters. This is an unconventional regime for modern deep networks, which are usually over-specified to have many more parameters than training samples (Zhang et al., 2017a).

The overall results of the fixed-classifier are summarized in [[Media: table1_fixed_results.png | Table 1]].

<center>[[File: table1_fixed_results.png]]</center>

==Language Modelling==

Recent works have empirically found that using the same weights for both word embedding and classifier can yield equal or better results than using a separate pair of weights. So the authors experimented with fix-classifiers on language modeling as it also requires classification of all possible tokens available in the task vocabulary. They trained a recurrent model with 2-layers of LSTM (Hochreiter & Schmidhuber, 1997) and embedding + hidden size of 512 on the WikiText2 dataset (Merity et al., 2016), using same settings as in Merity et al. (2017). WikiText2 dataset contains about 33K different words, so the number of parameters expected in the embedding and classifier layer was about 34-million. This number is about 89% of the total number of parameters used for the whole model which is 38-million. However, using a random orthogonal transform yielded poor results compared to learned embedding. This was suspected to be due to semantic relationships captured in the embedding layer of language models, which is not the case in image classification task. The intuition was further confirmed by the much better results when pre-trained embeddings using word2vec algorithm by Mikolov et al. (2013) or PMI factorization as suggested by Levy & Goldberg (2014), were used.

<center>[[File: language.png]]</center>

=Discussion=

==Implications and Use Cases==

With the increasing number of classes in the benchmark datasets, computational demands for the final classifier will increase as well. In order to understand the problem better, the authors observe the work by Sun et al. (2017), which introduced JFT-300M - an internal Google dataset with over 18K different classes. Using a Resnet50 (He et al., 2016), with a 2048 sized representation led to a model with over 36M parameters meaning that over 60% of the model parameters resided in the final classification layer. Sun et al. (2017) also describe the difficulty in distributing so many parameters over the training servers involving a non-trivial overhead during synchronization of the model for update. The authors claim that the fixed-classifier would help considerably in this kind of scenario - where using a fixed classifier removes the need to do any gradient synchronization for the final layer. Furthermore, introduction of Hadamard matrix removes the need to save the transformation altogether, thereby, making it more efficient and allowing considerable memory and computational savings.

==Possible Caveats==

The good performance of fixed-classifiers relies on the ability of the preceding layers to learn separable representations. This could be affected when the ratio between learned features and number of classes is small – that is, when <math> C > N</math>. However, they tested their method in such cases and their model performed well and provided good results.
Another factor that can affect the performance of their model using a fixed classifier is when the classes are highly correlated. In that case, the fixed classifier actually cannot support correlated classes and thus, the network could have some difficulty to learn. For a language model, word classes tend to have highly correlated instances, which also lead to difficult learning process.

Also, this proposed approach will only eliminate the computation of the classifier weights, so when the classes are fewer, the computation saving effect will not be readily apparent.

==Future Work==

The use of fixed classifiers might be further simplified in Binarized Neural Networks (Hubara et al., 2016a), where the activations and weights are restricted to ±1 during propagations. In that case, the norm of the last hidden layer would be constant for all samples (equal to the square root of the hidden layer width). The constant could then be absorbed into the scale constant <math>\alpha</math>, and there is no need in a per-sample normalization.

Additionally, more efficient ways to learn a word embedding should also be explored where similar redundancy in classifier weights may suggest simpler forms of token representations - such as low-rank or sparse versions.

A related paper was published that claims that fixing most of the parameters of the neural network achieves comparable results with learning all of them [A. Rosenfeld and J. K. Tsotsos]

=Conclusion=

In this work, the authors argue that the final classification layer in deep neural networks is redundant and suggest removing the parameters from the classification layer. The empirical results from experiments on the CIFAR and IMAGENET datasets suggest that such a change lead to little or almost no decline in the performance of the architecture. Furthermore, using a Hadmard matrix as classifier might lead to some computational benefits when properly implemented, and save memory otherwise spent on large amount of transformation coefficients.

Another possible scope of research that could be pointed out for future could be to find new efficient methods to create pre-defined word embeddings, which require huge amount of parameters that can possibly be avoided when learning a new task. Therefore, more emphasis should be given to the representations learned by the non-linear parts of the neural networks - up to the final classifier, as it seems highly redundant.

=Critique=

The paper proposes an interesting idea that has a potential use case when designing memory-efficient neural networks. The experiments shown in the paper are quite rigorous and provide support to the authors' claim. However, it would have been more helpful if the authors had described a bit more about efficient implementation of the Hadamard matrix and how to scale this method for larger datasets (cases with <math> C >N</math>).

Moreover, one of the main intuitions of the paper has introduced to be computational cost but it has left out to compare a fixed and learned classifier based on the computational cost and then investigate whether it worth the drop in performance or not considering the fact that not always the output can be degraded because of need for speed! At least a discussion on this issue is expected.

On the other hand, the computational cost and performance change after fixation of classifier could be related to dataset and the nature and complexity of it. Mostly, having 1000 classes makes the classification more crucial than 2 classes. An evaluation of this topic is also needed.

Another interesting experiment to do would be to look this technique interacts with distillation when used in the teacher or student network or both. For instance, Does fixing the features make it more difficult to place dog than on boat when classifying a cat? Do networks with fixed classifier weights make worse teachers for distillation?

=References=

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744, 1994.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. arXiv preprint arXiv:1705.09280, 2017.

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

A Hedayat, WD Wallis, et al. Hadamard matrices and their applications. The Annals of Statistics, 6
(6):1184–1238, 1978.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. ¨ Neural computation, 9(8): 1735–1780, 1997.

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Springer, 2015.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. 2017.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29 (NIPS’16), 2016a.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185, 2014.

Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
Jooyoung Park and Irwin W Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991.

Ofir Press and Lior Wolf. Using the output embedding to improve language models. EACL 2017,
pp. 157, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.

Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. 2018.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.

Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint arXiv:1611.03131, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. URL https://arxiv.org/abs/1611.03530.

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017b.

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

A. Rosenfeld and J. K. Tsotsos, “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” arXiv preprint arXiv:1802.00844, 2018.

learn what not to learn

2018-11-30T19:19:27Z

S366chen: /* Related Work */

=Introduction=

In reinforcement learning, it is often difficult for an agent to learn when the action space is large, especially the difficulties from function approximation and exploration. Some previous work has been trying to use Monte Carlo Tree Search to help address this problem. Monte Carlo Tree Search is a heuristic search algorithm that helps provides some indication of how good is an action, it works relatively well in a problem where the action space is large(like the one in this paper). One of the famous examples would be Google's Alphago that defeated the world champion in 2016, which uses MCTS in their reinforcement learning algorithm for the board game Go. When the action space is large, one com In some cases many actions are irrelevant and it is sometimes easier for the algorithm to learn which action not to take. The paper proposes a new reinforcement learning approach for dealing with large action spaces based on action elimination by restricting the available actions in each state to a subset of the most likely ones. There is a core assumption being made in the proposed method that it is easier to predict which actions in each state are invalid or inferior and use that information for control. More specifically, it proposes a system that learns the approximation of a Q-function and concurrently learns to eliminate actions. The method utilizes an external elimination signal which incorporates domain-specific prior knowledge. For example, in parser-based text games, the parser gives feedback regarding irrelevant actions after the action is played (e.g., Player: "Climb the tree." Parser: "There are no trees to climb"). Then a machine learning model can be trained to generalize to unseen states.

The paper focuses on tasks where both states and the actions are natural language. It introduces a novel deep reinforcement learning approach which has a Deep Q-Network (DQN) and an Action Elimination Network (AEN), both using the Convolutional Neural Networks (CNN) for Natural Language Processing (NLP) tasks. The AEN is trained to predict invalid actions, supervised by the elimination signal from the environment. The proposed method uses the final layer activations of AEN to build a linear contextual bandit model which allows the elimination of sub-optimal actions with high probability. '''Note that the core assumption is that it is easy to predict which actions are invalid or inferior in each state and leverage that information for control.'''

The text-based game called "Zork", which lets players to interact with a virtual world through a text-based interface is tested by using the elimination framework.
In this game, the player explores an environment using imagination of the text he/she reads. For more info, you can watch this video: [https://www.youtube.com/watch?v=xzUagi41Wo0 Zork].

The AEN algorithm has achieved a faster learning rate than the baseline agents by eliminating irrelevant actions.

Below shows an example for the Zork interface:

[[File:lnottol_fig1.png|500px|center]]

All states and actions are given in natural language. Input for the game contains more than a thousand possible actions in each state since the player can type anything.

=Related Work=

Text-Based Games(TBG): The state of the environment in TBG is described by simple language. The player interacts with the environment with text command which respects a pre-defined grammar. A popular example is Zork which has been tested in the paper. TBG is a good research intersection of RL and NLP, it requires language understanding, long-term memory, planning, exploration, affordability extraction, and common sense. It also often introduce stochastic dynamics to increase randomness.

Representations for TBG: Good word representation is necessary in order to learn control policies from high-dimensional complex data such as text. Previous work on TBG used pre-trained embeddings directly for control, other works combined pre-trained embedding with neural networks. For example, He
et al. (2015) proposed to consider an input as Bag Of Words features for a neural network, learned separately
embeddings for states and actions, and then computed the Q function from autocorrelations between
these embeddings.

DRL with linear function approximation: DRL methods such as the DQN have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly because neural networks can learn rich domain representations for value function and policy. On the other hand, linear representation batch reinforcement learning methods are more stable and accurate, while feature engineering is necessary.A natural attempt at getting the best of both worlds is to learn a linear control policy on top of the representation of the last layer of a DNN. This approach was shown to refine the performance of DQNs and improve exploration. Similarly, for contextual linear bandits, Riquelme et al. showed that a neuro-linear Thompson sampling approach outperformed deep
and linear bandit algorithms in practice.

RL in Large Action Spaces: Prior work concentrated on factorizing the action space into binary subspace(Pazis and Parr, 2011; Dulac-Arnold et al., 2012; Lagoudakis and Parr, 2003), other works proposed to embed the discrete actions into a continuous space, then choose the nearest discrete action according to the optimal actions in the continuous space(Dulac-Arnold et al., 2015; Van Hasselt and Wiering, 2009). He et. al. (2015)extended DQN to unbounded(natural language) action spaces.
Learning to eliminate actions was first mentioned by (Even-Dar, Mannor, and Mansour, 2003). They proposed to learn confidence intervals around the value function in each state. Lipton et al.(2016a) proposed to learn a classifier that detects hazardous state and then use it to shape the reward. Fulda et al.(2017) presented a method for affordability extraction via inner products of pre-trained word embedding.

=Action Elimination=

The approach in the paper builds on the standard Reinforcement Learning formulation. At each time step <math>t</math>, the agent observes state <math display="inline">s_t </math> and chooses a discrete action <math display="inline">a_t\in\{1,...,|A|\} </math>. Then, after action execution, the agent obtains a reward <math display="inline">r_t(s_t,a_t) </math> and observes next state <math display="inline">s_{t+1} </math> according to a transition kernel <math>P(s_{t+1}|s_t,a_t)</math>. The goal of the algorithm is to learn a policy <math display="inline">\pi(a|s) </math> which maximizes the expected future discounted cumulative return <math display="inline">V^\pi(s)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s]</math>, where <math> 0< \gamma <1 </math>. The Q-function is <math display="inline">Q^\pi(s,a)=E^\pi[\sum_{t=0}^{\infty}\gamma^tr(s_t,a_t)|s_0=s,a_0=a]</math>, and it can be optimized by Q-learning algorithm.

After executing an action, the agent observes a binary elimination signal <math>e(s, a)</math> to determine which actions not to take. It equals 1 if action <math>a</math> may be eliminated in state <math>s</math> (and 0 otherwise). The signal helps mitigating the problem of large discrete action spaces. We start with the following definitions:

'''Definition 1:'''

Valid state-action pairs with respect to an elimination signal are state action pairs which the elimination process should not eliminate.

The set of valid state-action pairs contains all of the state-action pairs that are a part of some optimal policy, i.e., only strictly suboptimal state-actions can be invalid.

'''Definition 2:'''

Admissible state-action pairs with respect to an elimination algorithm are state action pairs which the elimination algorithm does not eliminate.

'''Definition 3:'''

Action Elimination Q-learning is a Q-learning algorithm which updates only admissible state-action pairs and chooses the best action in the next state from its admissible actions. We allow the base Q-learning algorithm to be any algorithm that converges to <math display="inline">Q^*</math> with probability 1 after observing each state-action infinitely often.

==Advantages of Action Elimination==

The main advantage of action elimination is that it allows the agent to overcome some of the main difficulties in large action spaces which are Function Approximation and Sample Complexity.

Function approximation: Errors in the Q-function estimates may cause the learning algorithm to converge to a suboptimal policy, this phenomenon becomes more noticeable when the action space is large. Action elimination mitigates this effect by taking the max operator only on valid actions, thus, reducing potential overestimation errors. Besides, by ignoring the invalid actions, the function approximation can also learn a simpler mapping (i.e., only the Q-values of the valid state-action pairs) leading to faster convergence and better solution.

Sample complexity: The sample complexity measures the number of steps during learning, in which the policy is not <math display="inline">\epsilon</math>-optimal. Assume that there are <math>A'</math> actions that should be eliminated and are <math>\epsilon</math>-optimal, i.e. their value is at least <math>V^*(s)-\epsilon</math>. The invalid action often returns no reward and doesn't change the state, (Lattimore and Hutter, 2012)resulting in an action gap of <math display="inline">\epsilon=(1-\gamma)V^*(s)</math>, and this translates to <math display="inline">V^*(s)^{-2}(1-\gamma)^{-5}log(1/\delta)</math> wasted samples for learning each invalid state-action pair. Practically, elimination algorithm can eliminate these invalid actions and therefore speed up the learning process approximately by <math display="inline">A/A'</math>.

Because it is difficult to embed the elimination signal into the MDP, the authors use contextual multi-armed bandits to decouple the elimination signal from the MDP, which can correctly eliminate actions when applying standard Q learning into learning process.

==Action elimination with contextual bandits==

Contextual bandit problem is a famous probability problem and is a natural extension from the multi-arm bandit problem.

Let <math display="inline">x(s_t)\in R^d </math> be the feature representation of <math display="inline">s_t </math>. We assume that under this representation there exists a set of parameters <math display="inline">\theta_a^*\in \mathbb{R}^d </math> such that the elimination signal in state <math display="inline">s_t </math> is <math display="inline">e_t(s_t,a) = \theta_a^{*T}x(s_t)+\eta_t </math>, where <math display="inline"> \Vert\theta_a^*\Vert_2\leq S</math>. <math display="inline">\eta_t</math> is an R-subgaussian random variable with zero mean that models additive noise to the elimination signal. When there is no noise in the elimination signal, R=0. Otherwise, <math display="inline">R\leq 1</math> since the elimination signal is bounded in [0,1]. Assume the elimination signal satisfies: <math display="inline">0\leq E[e_t(s_t,a)]\leq l </math> for any valid action and <math display="inline"> u\leq E[e_t(s_t, a)]\leq 1</math> for any invalid action. And <math display="inline"> l\leq u</math>. Denote by <math display="inline">X_{t,a}</math> as the matrix whose rows are the observed state representation vectors in which action a was chosen, up to time t. <math display="inline">E_{t,a}</math> as the vector whose elements are the observed state representation elimination signals in which action a was chosen, up to time t. Denote the solution to the regularized linear regression <math display="inline">\Vert X_{t,a}\theta_{t,a}-E_{t,a}\Vert_2^2+\lambda\Vert \theta_{t,a}\Vert_2^2 </math> (for some <math display="inline">\lambda>0</math>) by <math display="inline">\hat{\theta}_{t,a}=\bar{V}_{t,a}^{-1}X_{t,a}^TE_{t,a} </math>, where <math display="inline">\bar{V}_{t,a}=\lambda I + X_{t,a}^TX_{t,a}</math>.

According to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011), <math display="inline">|\hat{\theta}_{t,a}^{T}x(s_t)-\theta_a^{*T}x(s_t)|\leq\sqrt{\beta_t(\delta)x(s_t)^T\bar{V}_{t,a}^{-1}x(s_t)}\ \forall t>0</math>, where <math display="inline">\sqrt{\beta_t(\delta)}=R\sqrt{2\ \text{log}(\text{det}(\bar{V}_{t,a})^{1/2}\text{det}(\lambda I)^{-1/2}/\delta)}+\lambda^{1/2}S</math>, with probability of at least <math display="inline">1-\delta</math>. If <math display="inline">\forall s\ ,\Vert x(s)\Vert_2 \leq L</math>, then <math display="inline">\beta_t</math> can be bounded by <math display="inline">\sqrt{\beta_t(\delta)} \leq R \sqrt{d\ \text{log}(1+tL^2/\lambda/\delta)}+\lambda^{1/2}S</math>. Next, define <math display="inline">\tilde{\delta}=\delta/k</math> and bound this probability for all the actions. i.e., <math display="inline">\forall a,t>0</math>

<math display="inline">Pr(|\hat{\theta}_{t-1,a}^{T}x(s_t)-\theta_{t-1, a}^{*T}x(s_t)|\leq\sqrt{\beta_t(\tilde\delta)x(s_t)^T\bar{V}_{t - 1,a}^{-1}x(s_t)}) \leq 1-\delta</math>

Recall that <math display="inline">E[e_t(s,a)]=\theta_a^{*T}x(s_t)\leq l</math> if a is a valid action. Then we can eliminate action a at state <math display="inline">s_t</math> if it satisfies:

<math display="inline">\hat{\theta}_{t-1,a}^{T}x(s_t)-\sqrt{\beta_{t-1}(\tilde\delta)x(s_t)^T\bar{V}_{t-1,a}^{-1}x(s_t)})>l</math>

with probability <math display="inline">1-\delta</math> that we never eliminate any valid action. Note that <math display="inline">l, u</math> are not known. In practice, choosing <math display="inline">l</math> to be 0.5 should suffice.

==Concurrent Learning==
In fact, Q-learning and contextual bandit algorithms can learn simultaneously, resulting in the convergence of both algorithms, i.e., finding an optimal policy and a minimal valid action space.

If the elimination is done based on the concentration bounds of the linear contextual bandits, it can be ensured that Action Elimination Q-learning converges, as shown in Proposition 1.

'''Proposition 1:'''

Assume that all state action pairs (s,a) are visited infinitely often, unless eliminated according to <math display="inline">\hat{\theta}_{t-1,a}^Tx(s)-\sqrt{\beta_{t-1}(\tilde{\delta})x(s)^T\bar{V}_{t-1,a}^{-1}x(s))}>l</math>. Then, with a probability of at least <math display="inline">1-\delta</math>, action elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In addition, actions which should be eliminated are visited at most <math display="inline">T_{s,a}(t)\leq 4\beta_t/(u-l)^2
+1</math> times.

Notice that when there is no noise in the elimination signal(R=0), we correctly eliminate actions with probability 1. so invalid actions will be sampled a finite number of times.

=Method=

The assumption that <math display="inline">e_t(s_t,a)=\theta_a^{*T}x(s_t)+\eta_t </math> generally does not hold when using raw features like word2vec. So the paper proposes to use the neural network's last layer as feature representation of states. A practical challenge here is that the features must be fixed over time when used by the contextual bandit. So batch-updates framework(Levine et al., 2017;Riquelme, Tucker, and Snoek, 2018) is used, where a new contextual bandit model is learned for every few steps that uses the last layer activation of the AEN as features.

==Architecture of action elimination framework==

[[File:lnottol_fig1b.png|300px|center]]

After taking action <math display="inline">a_t</math>, the agent observes <math display="inline">(r_t,s_{t+1},e_t)</math>. The agent uses it to learn two function approximation deep neural networks: A DQN and an AEN. AEN provides an admissible actions set <math display="inline">A'</math> to the DQN, which uses this set to decide how to act and learn. The architecture for both the AEN and DQN is an NLP CNN(100 convolutional filters for AEN and 500 for DQN, with three different 1D kernels of length (1,2,3)), based on(Kim, 2014). The state is represented as a sequence of words, composed of the game descriptor and the player's inventory. These are truncated or zero padded to a length of 50 descriptor + 15 inventory words and each word is embedded into continuous vectors using word2vec in <math display="inline">R^{300}</math>. The features of the last four states are then concatenated together such that the final state representations s are in <math display="inline">R^{78000}</math>. The AEN is trained to minimize the MSE loss, using the elimination signal as a label. The code, the Zork domain, and the implementation of the elimination signal can be found [https://github.com/TomZahavy/CB_AE_DQN here.]

==Psuedocode of the Algorithm==

[[File:lnottol_fig2.png|750px|center]]

AE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm creates a linear contextual bandit model from it every L iterations with procedure AENUpdate(). This procedure uses the activations of the last hidden layer of E as features, which are then used to create a contextual linear bandit model.AENUpdate() then solved this model and plugin it into the target AEN. The contextual linear bandit model <math display="inline">(E^-,V)</math> is then used to eliminate actions via the ACT() and Target() functions. ACT() follows an <math display="inline">\epsilon</math>-greedy mechanism on the admissible actions set. For exploitation, it selects the action with highest Q-value by taking an argmax on Q-values among <math display="inline">A'</math>. For exploration, it selects an action uniformly from <math display="inline">A'</math>. The targets() procedure is estimating the value function by taking max over Q-values only among admissible actions, hence, reducing function approximation errors.

=Experiments=
==Grid Domain==
The authors start by evaluating our algorithm on a small grid world domain with 9 rooms, where they ca analyze the effect of the action elimination (visualization can be found in the appendix). In this domain, the agent starts at the center of the grid and needs to navigate to its upper-left corner. On every step, the agent suffers a penalty of (−1), with a terminal reward of 0. Prior to the game, the states are randomly divided into K categories. The environment has 4K navigation actions, 4 for each category, each with a probability to move in a random direction. If the chosen action belongs to the same category as the state, the action is performed correctly in probability pTc = 0.75. Otherwise, it will be performed correctly in probability pFc = 0.5. If the action does not fit the state category, the elimination signal equals 1, and if the action and state belong to the same category, then e = 0. The optimal policy will only use the navigation actions from the same type as the state, and all of the other actions are strictly suboptimal. A basic comparison between vanilla Q-learning without action elimination (green) and a tabular version of the action elimination Q-learning (blue) can be found in the figure below. In all of the figures, the results are compared to the case with one category (red), i.e., only 4 basic navigation actions, which forms an upper bound on performance with multiple categories. In Figure (a),(c), the episode length is T = 150, and in Figure (b) it is T = 300, to allow sufficient exploration for the vanilla Q-Learning. It is clear from the simulations that the action elimination dramatically improves the results in large action spaces. Also, note that the gain from action elimination increases with the grid size since the elimination allows the agent to reach the goal earlier.

[[File:griddomain.png|1200px|thumb|center|Performance of agents in grid world]]
==Zork domain==

The world of Zork presents a rich environment with a large state and action space.
Zork players describe their actions using natural language instructions. For example, "open the mailbox". Then their actions were processed by a sophisticated natural language parser. Based on the results, the game presents the outcome of the action. The goal of Zork is to collect the Twenty Treasures of Zork and install them in the trophy case. Points that are generated from the game's scoring system are given to the agent as the reward. For example, the player gets the points when solving the puzzles. Placing all treasures in the trophy will get 350 points. The elimination signal is given in two forms, "wrong parse" flag, and text feedback "you cannot take that". These two signals are grouped together into a single binary signal which then provided to the algorithm.

[[File:zork_domain.png|1200px|thumb|center|Left:the world of Zork.Right:subdomains of Zork.]]

Experiments begin with the two subdomains of Zork domains: Egg Quest and the Troll Quest. For these subdomains, an additional reward signal is provided to guide the agent towards solving specific tasks and make the results more visible. A reward of -1 is applied at every time step to encourage the agent to favor short paths. Each trajectory terminates is upon completing the quest or after T steps are taken. The discounted factor for training is <math display="inline">\gamma=0.8</math> and <math display="inline">\gamma=1</math> during evaluation. Also <math display="inline">\beta=0.5, l=0.6</math> in all experiments.

===Egg Quest===

The goal for this quest is to find and open the jewel-encrusted egg hidden on a tree in the forest. An egg-splorer goes on an adventure to find a mystical ancient relic with his furry companion. You can have a look at the game at [https://scratch.mit.edu/projects/212838126/ EggQuest]

The agent will get 100 points upon completing this task. For action space, there are 9 fixed actions for navigation, and a second subset which consisting <math display="inline">N_{Take}</math> actions for taking possible objects in the game. <math display="inline">N_{Take}=200 (set A_1), N_{Take}=300 (set A_2)</math> has been tested separately.
AE-DQN (blue) and a vanilla DQN agent (green) has been tested in this quest.

[[File:AEF_zork_comparison.png|1200px|thumb|center|Performance of agents in the egg quest.]]

Figure a) corresponds to the set <math display="inline">A_1</math>, with T=100, b) corresponds to the set <math display="inline">A_2</math>, with T=100, and c) corresponds to the set <math display="inline">A_2</math>, with T=200. Both agents have performed well on sets a and c. However, the AE-DQN agent has learned much faster than the DQN on set b, which implies that action elimination is more robust to hyperparameter optimization when the action space is large. One important observation to note is that the three figures have different scales for the cumulative reward. While the AE-DQN outperformed the standard DQN in figure b, both models performed significantly better with the hyperparameter configuration in figure c.

===Troll Quest===

The goal of this quest is to find the troll. To do it the agent needs to find the way to the house, use a lantern to expose the hidden entrance to the underworld. It will get 100 points upon achieving the goal. This quest is a larger problem than Egg Quest. The action set <math display="inline">A_1</math> is 200 take actions and 15 necessary actions, 215 in total.

[[File:AEF_troll_comparison.png|400px|thumb|center|Results in the Troll Quest.]]

The red line above is an "optimal elimination" baseline which consists of only 35 actions(15 essential and 20 relevant take actions). We can see that AE-DQN still outperforms DQN and its improvement over DQN is more significant in the Troll Quest than the Egg quest. Also, it achieves compatible performance to the "optimal elimination" baseline.

===Open Zork===

Lastly, the "Open Zork" domain has been tested which only the environment reward has been used. 1M steps have been trained. Each trajectory terminates after T=200 steps. Two action sets have been used:<math display="inline">A_3</math>, the "Minimal Zork" action set, which is the minimal set of actions (131) that is required to solve the game. <math display="inline">A_4</math>, the "Open Zork" action set (1227) which composed of {Verb, Object} tuples for all the verbs and objects in the game.

[[]]

[[File:AEF_open_zork_comparison.png|600px|thumb|center|Results in "Open Zork".]]

The above Figure shows the learning curve for both AE-DQN and DQN. We can see that AE-DQN (blue) still outperform the DQN (blue) in terms of speed and cumulative reward.

=Conclusion=
In this paper, the authors proposed a Deep Reinforcement Learning model for sub-optimal actions while performing Q-learning. Moreover, they showed that by eliminating actions, using linear contextual bandits with theoretical guarantees of convergence, the size of the action space is reduced, exploration is more effective, and learning is improved when tested on Zork, a text-based game.

For future work the authors aim to investigate more sophisticated architectures and tackle learning shared representations for elimination and control which may boost performance on both tasks.

They also hope to to investigate other mechanisms for action elimination, such as eliminating actions that result from low Q-values as in Even-Dar, Mannor, and Mansour, 2003.

The authors also hope to generate elimination signals in real-world domains and achieve the purpose of eliminating the signal implicitly.

=Critique=
The paper is not a significant algorithmic contribution and it merely adds an extra layer of complexity to the very famous DQN algorithm. All the experimental domains considered in the paper are discrete action problems that have so many actions that it could have been easily extended to a continuous action problem. In continuous action space there are several policy gradient based RL algorithms that have provided stronger performances. The authors should have ideally compared their methods to such algorithms like PPO or DRPO.

Even with the critique above, the paper presents mathematical/theoretical justifications of the methodology. Moreover, since the methodology is built on the standard RL framework, this means that other variant RL algorithms can apply the idea to decrease the complexity and increase the performance. Moreover, the there are some rooms for applying technical variations for the algorithm.

Also, since we are utilizing the system's response to irrelevant actions, an intuitive approach to eliminate such irrelevant actions is to add a huge negative reward for such actions, which will be much easier than the approach suggested by this paper. However, the in experiments, the author only compares AE-DQN to traditional DQN, not traditional DQN with negative rewards assigned to irrelevant actions.

After all, the name that the authors have chosen is a good and attractive choice and matches our brain's structure which in so many real-world scenarios detects what not to learn.

=Reference=
1. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics.

2. Côté,M.-A.;Kádár,Á.;Yuan,X.;Kybartas,B.;Barnes,T.;Fine,E.;Moore,J.;Hausknecht,M.;Asri, L. E.; Adada, M.; et al. 2018. Textworld: A learning environment for text-based games. arXiv.

3. Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces. arXiv.

4. He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 2015. Deep reinforcement learning with an unbounded action space. CoRR abs/1511.04636.

5. Kim, Y. 2014. Convolutional neural networks for sentence classiﬁcation. [https://arxiv.org/abs/1408.5882 arXiv preprint].

6. VanHasselt,H.,andWiering,M.A. 2009. Usingcontinuousactionspacestosolvediscreteproblems. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, 1149–1156. IEEE.

7. Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279–292.

8. Su, P.-H.; Gasic, M.; Mrksic, N.; Rojas-Barahona, L.; Ultes, S.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. Continuously learning neural dialogue management. arXiv preprint.

9. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint.

10. Yuan, X.; Côté, M.-A.; Sordoni, A.; Laroche, R.; Combes, R. T. d.; Hausknecht, M.; and Trischler, A. 2018. Counting to explore and generalize in text-based games. arXiv preprint arXiv:1806.1152

11. Zahavy, T.; Haroush, M.; Merlis, N.; Mankowitz, D. J.; 2018. Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning. arXiv:1809.02121v1

CapsuleNets

2018-11-29T17:39:11Z

S366chen: /* Critique */

The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "[https://openreview.net/pdf?id=HJWLfGWRb Matrix Capsules with EM Routing]" for ICLR 2018.

=Motivation=

Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.
==Adversarial Examples==

First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally fool a machine learning model. An example of an adversarial example is shown below:

[[File:adversarial_img_1.png ‎|center]]
To the human eye, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defences are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as: its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.

==Drawbacks of CNNs==
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a kxk kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features, but causes valuable spatial information to be lost.

In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.

[[File:Equivariance Face.png ‎|center]]

Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.

[[File:kitten.jpeg ‎|center]]

[[File:kitten-rotated-180.jpg ‎|center]]

For a more in depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).

==Intuition for Capsules==
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed.

To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).

[[File:Rotational Invariance.jpeg ‎|center]]

Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks require.

=Background, Notation, and Definitions=

==What is a Capsule==
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."

In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.

A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:

<blockquote>
A capsule network consists of several layers of capsules. The set of capsules in layer L is denoted
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the
activities in a standard neural net: they depend on the current input and are not stored. In between
each capsule i in layer L and each capsule j in layer L + 1 is a 4x4 trainable transformation matrix,
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they
are learned discriminatively. The pose matrix of capsule i is transformed by <math>W_{ij}</math> to cast a vote
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule j. The poses and activations of all the capsules in layer
L + 1 are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all
<math>i \in \Omega_L, j \in \Omega_{L+1}</math>
</blockquote>
<math></math>

==Notation==

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0.

\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}

where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.

For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math>

\begin{align}
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i
\end{align}
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.

The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.

\begin{align}
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}
\end{align}

=Network Training and Dynamic Routing=

==Understanding Capsules==
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.

[[File:CapsuleNets.jpeg|center|800px]]

The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.

We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.

[[File:Predictions.jpeg ‎|center]]

==Dynamic Routing==
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math>

In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.
[[File:Dynamic Routing.png|center|900px]]

From the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observations into capsule K; we adjust the corresponding weights upward during training.

These weights are determined through the dynamic routing procedure:
[[File:Routing Algo.png‎|900px]]

Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper was released in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).

=Architecture=
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.

==Loss Function==
[[File:Loss Function.png‎|900px]]

The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.

A graphical representation of loss function values under varying vector norms is given below.
[[File:Loss function chart.png|900px]]

==Encoder Layers==
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising.

[[File:Architecture.png|center|900px]]

The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.

'''Layer 1: Convolution''':
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.

'''Layer 2: PrimaryCaps''':
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.

'''Layer 3: DigiCaps''':
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output a ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.

==Decoder Layers==
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.

[[File:Decoder.png|center|900px]]

The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.

In addition to the digicaps loss function, we add reconstruction error as a form of regularization. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.

[[File:Reconstruction.png|center|900px]]

=MNIST Experimental Results=

==Accuracy==
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.

[[File:Accuracies.png|center|900px]]

==What Capsules Represent for MNIST==
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties.
[[File:CapsuleReps.png|center|900px]]

One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.

==Robustness of CapsNet==
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.

To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.

=MultiMNIST & Other Experiments=

==MultiMNIST==
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).

There are some additional steps to generating the MultiMNIST dataset.

1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)

2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.

[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.
The two reconstructed digits are overlayed in green and red as the lower image. The upper image
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)
represents the two digits used for reconstruction. The two right most columns show two examples
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have
correct classifications and show that the model accounts for all the pixels while being able to assign
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a
digit that is neither the label nor the prediction. These columns suggest that the model is not just
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other
support.]]

==Other datasets==
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.

The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.

=Critique=
Although the network performs incredibly favorable in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.

Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.

Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done.

Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.

* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.

=Future Work=
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.
Moreover, the authors hint towards trying to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.

Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.

=References=
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]
#Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.

CapsuleNets

2018-11-29T17:36:55Z

S366chen: /* Dynamic Routing */

The paper "Dynamic Routing Between Capsules" was written by three researchers at Google Brain: Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. This paper was published and presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California. The same three researchers recently published a highly related paper "[https://openreview.net/pdf?id=HJWLfGWRb Matrix Capsules with EM Routing]" for ICLR 2018.

=Motivation=

Ever since AlexNet eclipsed the performance of competing architectures in the 2012 ImageNet challenge, convolutional neural networks have maintained their dominance in computer vision applications. Despite the recent successes and innovations brought about by convolutional neural networks, some assumptions made in these networks are perhaps unwarranted and deficient. Using a novel neural network architecture, the authors create CapsuleNets, a network that they claim is able to learn image representations in a more robust, human-like manner. With only a 3 layer capsule network, they achieved near state-of-the-art results on MNIST.
==Adversarial Examples==

First discussed by Christian Szegedy et. al. in late 2013, adversarial examples have been heavily discussed by the deep learning community as a potential security threat to AI learning. Adversarial examples are defined as inputs that an attacker creates intentionally fool a machine learning model. An example of an adversarial example is shown below:

[[File:adversarial_img_1.png ‎|center]]
To the human eye, the image appears to be a panda both before and after noise is injected into the image, whereas the trained ConvNet model discerns the noisy image as a Gibbon with almost 100% certainty. The fact that the network is unable to classify the above image as a panda after the epsilon perturbation leads to many potential security risks in AI dependent systems such as self-driving vehicles. Although various methods have been suggested to combat adversarial examples, robust defences are hard to construct due to the inherent difficulties in constructing theoretical models for the adversarial example crafting process. However, beyond the fact that these examples may serve as a security threat, it emphasizes that these convolutional neural networks do not learn image classification/object detection patterns the same way that a human would. Rather than identifying the core features of a panda such as: its eyes, mouth, nose, and the gradient changes in its black/white fur, the convolutional neural network seems to be learning image representations in a completely different manner. Deep learning researchers often attempt to model neural networks after human learning, and it is clear that further steps must be taken to robustify ConvNets against targeted noise perturbations.

==Drawbacks of CNNs==
Hinton claims that the key fault with traditional CNNs lies within the pooling function. Although pooling builds translational invariance into the network, it fails to preserve spatial relationships between objects. When we pool, we effectively reduce a kxk kernel of convolved cells into a scalar input. This results in a desired local invariance without inhibiting the network's ability to detect features, but causes valuable spatial information to be lost.

In the example below, the network is able to detect the similar features (eyes, mouth, nose, etc) within both images, but fails to recognize that one image is a human face, while the other is a Picasso-esque due to the CNN's inability to encode spatial relationships after multiple pooling layers.

[[File:Equivariance Face.png ‎|center]]

Conversely, we hope that a CNN can recognize that both of the following pictures contain a kitten. Unfortunately, when we feed the two images into a ResNet50 architecture, only the first image is correctly classified, while the second image is predicted to be a guinea pig.

[[File:kitten.jpeg ‎|center]]

[[File:kitten-rotated-180.jpg ‎|center]]

For a more in depth discussion on the problems with ConvNets, please listen to Geoffrey Hinton's talk "What is wrong with convolutional neural nets?" given at MIT during the Brain & Cognitive Sciences - Fall Colloquium Series (December 4, 2014).

==Intuition for Capsules==
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Hinton argues that our brains reason visual information by deconstructing it into a hierarchical representation which we then match to familiar patterns and relationships from memory. The key difference between this understanding and the functionality of CNNs is that recognition of an object should not depend on the angle from which it is viewed.

To enforce rotational and translational equivariance, Capsule Networks store and preserve hierarchical pose relationships between objects. The core idea behind capsule theory is the explicit numerical representations of relative relationships between different objects within an image. Building these relationships into the Capsule Networks model, the network is able to recognize newly seen objects as a rotated view of a previously seen object. For example, the below image shows the Statue of Liberty under five different angles. If a person had only seen the Statue of Liberty from one angle, they would be able to ascertain that all five pictures below contain the same object (just from a different angle).

[[File:Rotational Invariance.jpeg ‎|center]]

Building on this idea of hierarchical representation of spatial relationships between key entities within an image, the authors introduce Capsule Networks. Unlike traditional CNNs, Capsule Networks are better equipped to classify correctly under rotational invariance. Furthermore, the authors managed to achieve state of the art results on MNIST using a fraction of the training samples that alternative state of the art networks require.

=Background, Notation, and Definitions=

==What is a Capsule==
"Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold."

In essence, capsules store object properties in a vector form; probability of detection is encoded as the vector's length, while spatial properties are encoded as the individual vector components. Thus, when a feature is present but the image captures it under a different angle, the probability of detection remains unchanged.

A brief overview/understanding of capsules can be found in other papers from the author. To quote from [https://openreview.net/pdf?id=HJWLfGWRb this paper]:

<blockquote>
A capsule network consists of several layers of capsules. The set of capsules in layer L is denoted
as <math>\Omega_L</math>. Each capsule has a 4x4 pose matrix, <math>M</math>, and an activation probability, <math>a</math>. These are like the
activities in a standard neural net: they depend on the current input and are not stored. In between
each capsule i in layer L and each capsule j in layer L + 1 is a 4x4 trainable transformation matrix,
<math>W_{ij}</math> . These <math>W_{ij}</math>'s (and two learned biases per capsule) are the only stored parameters and they
are learned discriminatively. The pose matrix of capsule i is transformed by <math>W_{ij}</math> to cast a vote
<math>V_{ij} = M_iW_{ij}</math> for the pose matrix of capsule j. The poses and activations of all the capsules in layer
L + 1 are calculated by using a non-linear routing procedure which gets as input <math>V_{ij}</math> and <math>a_i</math> for all
<math>i \in \Omega_L, j \in \Omega_{L+1}</math>
</blockquote>
<math></math>

==Notation==

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. The paper performs a non-linear squashing operation to ensure that vector length falls between 0 and 1, with shorter vectors (less likely to exist entities) being shrunk towards 0.

\begin{align} \mathbf{v}_j &= \frac{||\mathbf{s}_j||^2}{1+ ||\mathbf{s}_j||^2} \frac{\mathbf{s}_j}{||\mathbf{s}_j||} \end{align}

where <math>\mathbf{v}_j</math> is the vector output of capsule <math>j</math> and <math>s_j</math> is its total input.

For all but the first layer of capsules, the total input to a capsule <math>s_j</math> is a weighted sum over all “prediction vectors” <math>\hat{\mathbf{u}}_{j|i}</math> from the capsules in the layer below and is produced by multiplying the output <math>\mathbf{u}i</math> of a capsule in the layer below by a weight matrix <math>\mathbf{W}ij</math>

\begin{align}
\mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, ~\hspace{0.5em} \hat{\mathbf{u}}_{j|i}= \mathbf{W}_{ij}\mathbf{u}_i
\end{align}
where the <math>c_{ij}</math> are coupling coefficients that are determined by the iterative dynamic routing process.

The coupling coefficients between capsule <math>i</math> and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits <math>b_{ij}</math> are the log prior probabilities that capsule <math>i</math> should be coupled to capsule <math>j</math>.

\begin{align}
c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}
\end{align}

=Network Training and Dynamic Routing=

==Understanding Capsules==
The notation can get somewhat confusing, so I will provide intuition behind the computational steps within a capsule. The following image is taken from naturomic's talk on Capsule Networks.

[[File:CapsuleNets.jpeg|center|800px]]

The above image illustrates the key mathematical operations happening within a capsule (and compares them to the structure of a neuron). Although the operations are rather straightforward, it's crucial to note that the capsule computes an affine transformation onto each input vector. The length of the input vectors <math>\mathbf{u}_{i}</math> represent the probability of entity <math>i</math> existing in a lower level. This vector is then reoriented with an affine transform using <math>\mathbf{W}_{ij}</math> matrices that encode spatial relationships between entity <math>\mathbf{u}_{i}</math> and other lower level features.

We illustrate the intuition behind vector-to-vector matrix multiplication within capsules using the following example: if vectors <math>\mathbf{u}_{1}</math>, <math>\mathbf{u}_{2}</math>, and <math>\mathbf{u}_{3}</math> represent detection of eyes, nose, and mouth respectively, then after multiplication with trained weight matrices <math>\mathbf{W}_{ij}</math> (where j denotes existence of a face), we should get a general idea of the general location of the higher level feature (face), similar to the image below.

[[File:Predictions.jpeg ‎|center]]

==Dynamic Routing==
A capsule <math>i</math> in a lower-level layer needs to decide how to send its output vector to higher-level capsules <math>j</math>. This decision is made with probability proportional to <math>c_{ij}</math>. If there are <math>K</math> capsules in the level that capsule <math>i</math> routes to, then we know the following properties about <math>c_{ij}</math>: <math>\sum_{j=1}^M c_{ij} = 1, c_{ij} \geq 0</math>

In essence, the <math>\{c_{ij}\}_{j=1}^M</math> denotes a discrete probability distribution with respect to capsule <math>i</math>'s output location. Lower level capsules decide which higher level capsules to send vectors into by adjusting the corresponding routing weights <math>\{c_{ij}\}_{j=1}^M</math>. After a few iterations in training, numerous vectors will have already been sent to all higher level capsules. Based on the similarity between the current vector being routed and all vectors already sent into the higher level capsules, we decide which capsule to send the current vector into.
[[File:Dynamic Routing.png|center|900px]]

From the image above, we notice that a cluster of points similar to the current vector has already been routed into capsule K, while most points in capsule J are highly dissimilar. It thus makes more sense to route the current observations into capsule K; we adjust the corresponding weights upward during training.

These weights are determined through the dynamic routing procedure:
[[File:Routing Algo.png‎|900px]]

Although dynamic routing is not the only manner in which we can encode relationships between capsules, the premise of the paper is to demonstrate the capabilities of capsules under a simple implementation. Since the paper was released in 2017, numerous alternative routing implementations have been released including an EM matrix routing algorithm by the same authors (ICLR 2018).

=Architecture=
The capsule network architecture given by the authors has 11.36 million trainable parameters. The paper itself is not very detailed on exact implementation of each architectural layer, and hence it leaves some degree of ambiguity on coding various aspects of the original network. The capsule network has 6 overall layers, with the first three layers denoting components of the encoder, and the last 3 denoting components of the decoder.

==Loss Function==
[[File:Loss Function.png‎|900px]]

The cost function looks very complicated, but can be broken down into intuitive components. Before diving into the equation, remember that the length of the vector denotes the probability of object existence. The left side of the equation denotes loss when the network classifies an observation correctly; the term becomes zero when the classification is incorrect. To compute loss when the network correctly classifies the label, we subtract the vector norm from a fixed quantity <math>m^+ := 0.9</math>. On the other hand, when the network classifies a label incorrectly, we penalize the loss based on the network's confidence in the incorrect label; we compute the loss by subtracting <math>m^- := 0.1</math> from the vector norm.

A graphical representation of loss function values under varying vector norms is given below.
[[File:Loss function chart.png|900px]]

==Encoder Layers==
All experiments within this paper were conducted on the MNIST dataset, and thus the architecture is built to classify the corresponding dataset. For more complex datasets, the experiments were less promising.

[[File:Architecture.png|center|900px]]

The encoder layer takes in a 28x28 MNIST image and learns a 16 dimensional representation of instantiation parameters.

'''Layer 1: Convolution''':
This layer is a standard convolution layer. Using kernels with size 9x9x1, a stride of 1, and a ReLU activation function, we detect the 2D features within the network.

'''Layer 2: PrimaryCaps''':
We represent the low level features detected during convolution as 32 primary capsules. Each capsule applies eight convolutional kernels with stride 2 to the output of the convolution layer and feeds the corresponding transformed tensors into the DigiCaps layer.

'''Layer 3: DigiCaps''':
This layer contains 10 digit capsules, one for each digit. As explained in the dynamic routing procedure, each input vector from the PrimaryCaps layer has its own corresponding weight matrix <math>W_{ij}</math>. Using the routing coefficients <math>c_{ij}</math> and temporary coefficients <math>b_{ij}</math>, we train the DigiCaps layer to output a ten 16 dimensional vectors. The length of the <math>i^{th}</math> vector in this layer corresponds to the probability of detection of digit <math>i</math>.

==Decoder Layers==
The decoder layer aims to train the capsules to extract meaningful features for image detection/classification. During training, it takes the 16 layer instantiation vector of the correct (not predicted) DigiCaps layer, and attempts to recreate the 28x28 MNIST image as best as possible. Setting the loss function as reconstruction error (Euclidean distance between the reconstructed image and original image), we tune the capsules to encode features that are meaningful within the actual image.

[[File:Decoder.png|center|900px]]

The layer consists of three fully connected layers, and transforms a 16x1 vector from the encoder layer into a 28x28 image.

In addition to the digicaps loss function, we add reconstruction error as a form of regularization. We minimize the Euclidean distance between the outputs of the logistic units and the pixel intensities of the original and reconstructed images. We scale down this reconstruction loss by 0.0005 so that it does not dominate the margin loss during training. As illustrated below, reconstructions from the 16D output of the CapsNet are robust while keeping only important details.

[[File:Reconstruction.png|center|900px]]

=MNIST Experimental Results=

==Accuracy==
The paper tests on the MNIST dataset with 60K training examples, and 10K testing. Wan et al. [2013] achieves 0.21% test error with ensembling and augmenting the data with rotation and scaling. They achieve 0.39% without them. As shown in Table 1, the authors manage to achieve 0.25% test error with only a 3 layer network; the previous state of the art only beat this number with very deep networks. This example shows the importance of routing and reconstruction regularizer, which boosts the performance. On the other hand, while the accuracies are very high, the number of parameters is much smaller compared to the baseline model.

[[File:Accuracies.png|center|900px]]

==What Capsules Represent for MNIST==
The following figure shows the digit representation under capsules. Each row shows the reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by intervals of 0.05 in the range [−0.25, 0.25]. By tweaking the values, we notice how the reconstruction changes, and thus get a sense for what each dimension is representing. The authors found that some dimensions represent global properties of the digits, while other represent localized properties.
[[File:CapsuleReps.png|center|900px]]

One example the authors provide is: different dimensions are used for the length of the ascender of a 6 and the size of the loop. The variations include stroke thickness, skew and width, as well as digit-specific variations. The authors are able to show dimension representations using a decoder network by feeding a perturbed vector.

==Robustness of CapsNet==
The authors conclude that DigitCaps capsules learn more robust representations for each digit class than traditional CNNs. The trained CapsNet becomes moderately robust to small affine transformations in the test data.

To compare the robustness of CapsNet to affine transformations against traditional CNNs, both models (CapsNet and a traditional CNN with MaxPooling and DropOut) were trained on a padded and translated MNIST training set, in which each example is an MNIST digit placed randomly on a black background of 40 × 40 pixels. The networks were then tested on the [http://www.cs.toronto.edu/~tijmen/affNIST/ affNIST] dataset (MNIST digits with random affine transformation). An under-trained CapsNet which achieved 99.23% accuracy on the MNIST test set achieved a corresponding 79% accuracy on the affnist test set. A traditional CNN achieved similar accuracy (99.22%) on the mnist test set, but only 66% on the affnist test set.

=MultiMNIST & Other Experiments=

==MultiMNIST==
To evaluate the performance of the model on highly overlapping digits, the authors generate a 'MultiMNIST' dataset. In MultiMNIST, images are two overlaid MNIST digits of the same set(train or test) but different classes. The results indicate a classification error rate of 5%. Additionally, CapsNet can be used to segment the image into the two digits that compose it. Moreover, the model is able to deal with the overlaps and reconstruct digits correctly since each digit capsule can learn the style from the votes of PrimaryCapsules layer (Figure 5).

There are some additional steps to generating the MultiMNIST dataset.

1. Both images are shifted by up to 4 pixels in each direction resulting in a 36 × 36 image. Bounding boxes of digits in MNIST overlap by approximately 80%, so this is used to make both digits identifiable (since there is no RGB difference learnable by the network to separate the digits)

2. The label becomes a vector of two numbers, representing the original digit and the randomly generated (and overlaid) digit.

[[File:CapsuleNets MultiMNIST.PNG|600px|thumb|center|Figure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.
The two reconstructed digits are overlayed in green and red as the lower image. The upper image
shows the input image. L:(l1; l2) represents the label for the two digits in the image and R:(r1; r2)
represents the two digits used for reconstruction. The two right most columns show two examples
with wrong classification reconstructed from the label and from the prediction (P). In the (2; 8)
example the model confuses 8 with a 7 and in (4; 9) it confuses 9 with 0. The other columns have
correct classifications and show that the model accounts for all the pixels while being able to assign
one pixel to two digits in extremely difficult scenarios (column 1 − 4). Note that in dataset generation
the pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a
digit that is neither the label nor the prediction. These columns suggest that the model is not just
finding the best fit for all the digits in the image including the ones that do not exist. Therefore in case
of (5; 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that fit best and account for
all the pixels. Also, in the case of (8; 1) the loop of 8 has not triggered 0 because it is already accounted
for by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other
support.]]

==Other datasets==
The authors also tested the proposed capsule model on CIFAR10 dataset and achieved an error rate of 10.6%. The model tested was an ensemble of 7 models. Each of the models in the ensemble had the same architecture as the model used for MNIST (apart from 3 additional channels and 64 different types of primary capsules being used). These 7 models were trained on 24x24 patches of the training images for 3 iterations. During experimentation, the authors also found out that adding an additional none-of-the-above category helped improved the overall performance. The error rate achieved is comparable to the error rate achieved by a standard CNN model. According to the authors, one of the reasons for low performance is the fact that background in CIFAR-10 images are too varied for it to be adequately modeled by reasonably sized capsule net.

The proposed model was also evaluated using a small subset of SVHN dataset. The network trained was much smaller and trained using only 73257 training images. The network still managed to achieve an error rate of 4.3% on the test set.

=Critique=
Although the network performs incredibly favorably in the author's experiments, it has a long way to go on more complex datasets. On CIFAR 10, the network achieved subpar results, and the experimental results seem to be worse when the problem becomes more complex. This is anticipated, since these networks are still in their early stage; later innovations might come in the upcoming decades/years. It could also be wise to apply the model to other datasets with larger sizes to make the functionality more acceptable. MNIST dataset has simple patterns and even if the model wanted to be presented with only one dataset, it was better not to be MNIST dataset especially in this case that the focus is on human-eye detection and numbers are not that regular in real-life experiences.

Hinton talks about CapsuleNets revolutionizing areas such as self-driving, but such groundbreaking innovations are far away from CIFAR10, and even further from MNIST. Only time can tell if CapsNets will live up to their hype.

Capsules inherently segment images and learn a lower dimensional embedding in a new manner, which makes them likely to perform well on segmentation and computer vision tasks once further research is done.

Additionally, these networks are more interpretable than CNNs, and have strong theoretical reasoning for why they could work. Naturally, it would be hard for a new architecture to beat the heavily researched/modified CNNs.

* ([https://openreview.net/forum?id=HJWLfGWRb]) it's not fully clear how effective it can be performed / how scalable it is. Evaluation is performed on a small dataset for shape recognition. The approach will need to be tested on larger, more challenging datasets.

=Future Work=
The same authors [N. F. Geoffrey E Hinton, Sara Sabour] presented another paper "MATRIX CAPSULES WITH EM ROUTING" in ICLR 2018, which achieved better results than the work presented in this paper. They presented a new multi-layered capsule network architecture, implemented an EM routing procedure, and introduced "Coordinate Addition". This new type reduced number of errors by 45%, and performed better than standard CNN on white box adversarial attacks. Capsule architectures are gaining interest because of their ability to achieve equivariance of parts, and employ a new form of pooling called "routing" (as opposed to max pooling) which groups parts that make similar predictions of the whole to which they belong, rather than relying on spatial co-locality.
Moreover, the authors hint towards trying to change the curvature and sensitivities to various factors by introducing new form of loss function. It may improve the performance of the model for more complicated data set which is one of the model's drawback.

Moreover, as mentioned in critiques, a good future work for this group would be making the model more robust to the dataset and achieve acceptable performance on datasets with more regularly seen images in real life experiences.

=References=
#N. F. Geoffrey E Hinton, Sara Sabour. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.
#S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017
# Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011), Transforming Auto-encoders
#Geoffrey Hinton's talk: What is wrong with convolutional neural nets? - Talk given at MIT. Brain & Cognitive Sciences - Fall Colloquium Series. [https://www.youtube.com/watch?v=rTawFwUvnLE ]
#Understanding Hinton’s Capsule Networks - Max Pechyonkin's series [https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b]
#Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machinelearning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.
#Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visualattention.arXiv preprint arXiv:1412.7755, 2014.
#Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network.arXiv preprintarXiv:1511.02583, 2015.
#Dan C Cire ̧san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classification.arXiv preprint arXiv:1102.0183,2011.
#Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit numberrecognition from street view imagery using deep convolutional neural networks.arXiv preprintarXiv:1312.6082, 2013.

ShakeDrop Regularization

2018-11-29T17:10:21Z

S366chen: /* Proposed Method */

=Introduction=
Current state of the art techniques for object classification are deep neural networks based on the residual block, first published by (He et al., 2016). This technique has been the foundation of several improved networks, including Wide ResNet (Zagoruyko & Komodakis, 2016), PyramdNet (Han et al., 2017) and ResNeXt (Xie et al., 2017). They have been further improved by regularization, such as Stochastic Depth (ResDrop) (Huang et al., 2016) and Shake-Shake (Gastaldi, 2017), which can avoid some problem like vanishing gradients. Shake-Shake applied to ResNeXt has achieved one of the lowest error rates on the CIFAR-10 and CIFAR-100 datasets. However, it is only applicable to multi-branch architectures and is not memory efficient since it requires two branches of residual blocks to apply. To address this problem, ShakeDrop regularization that can realize a similar disturbance to Shake-Shake on a single residual block is proposed. Moreover, they use ResDrop to stabilize the learning process. This paper seeks to formulate a general expansion of Shake-Shake that can be applied to any residual block based network.

=Existing Methods=

'''Deep Approaches'''

'''ResNet''', was the first use of residual blocks, a foundational feature in many modern state of the art convolution neural networks. They can be formulated as <math>G(x) = x + F(x)</math> where <math>x</math> and <math>G(x)</math> are the input and output of the residual block, and <math>F(x)</math> is the output of the residual branch on the residual block. A residual block typically performs a convolution operation and then passes the result plus its input onto the next block.

Intuition behind Residual blocks:
If the identity mapping is optimal, We can easily push the residuals to zero (F(x) = 0) than to fit an identity mapping (x, input=output) by a stack of non-linear layers. In simple language it is very easy to come up with a solution like F(x) =0 rather than F(x)=x using stack of non-linear cnn layers as function (Think about it). So, this function F(x) is what the authors called Residual function ([https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624 Reference]).

[[File:ResidualBlock.png|580px|centre|thumb|An example of a simple residual block from Deep Residual Learning for Image Recognition by He et al., 2016]]

ResNet is constructed out of a large number of these residual blocks sequentially stacked. It is interesting to note that having too many layers can cause overfitting, as pointed out by He et al. (2016) with the high error rates for the 1,202-layer ResNet on CIFAR datasets. Another paper (Veit et al., 2016) empirically showed that the cause of the high error rates can be mostly attributed to specific residual blocks whose channels increase greatly.

'''PyramidNet''' is an important iteration that built on ResNet and WideResNet by gradually increasing channels on each residual block. The residual block is similar to those used in ResNet. It has been used to generate some of the first successful convolution neural networks with very large depth, at 272 layers. Amongst unmodified residual network architectures, it performs the best on the CIFAR datasets.

[[File:ResidualBlockComparison.png|980px|centre|thumb|A simple illustration of different residual blocks from Deep Pyramidal Residual Networks by Han et al., 2017. The width of a block reflects the number of channels used in that layer.]]

'''Non-Deep Approaches'''

'''Wide ResNet''' modified ResNet by increasing channels in each layer, having a wider and shallower structure. Similarly to PyramidNet, this architecture avoids some of the pitfalls in the original formulation of ResNet.

'''ResNeXt''' achieved performance beyond that of Wide ResNet with only a small increase in the number of parameters. It can be formulated as <math>G(x) = x + F_1(x)+F_2(x)</math>. In this case, <math>F_1(x)</math> and <math>F_2(x)</math> are the outputs of two paired convolution operations in a single residual block. The number of branches is not limited to 2, and will control the result of this network.

[[File:SimplifiedResNeXt.png|600px|centre|thumb|Simplified ResNeXt Convolution Block. Yamada et al., 2018]]

'''Regularization Methods'''

'''Stochastic Depth''' helped address the issue of vanishing gradients in ResNet. It works by randomly dropping residual blocks. On the <math>l^{th}</math> residual block the Stochastic Depth process is given as <math>G(x)=x+b_lF(x)</math> where <math>b_l \in \{0,1\}</math> is a Bernoulli random variable with probability <math>p_l</math>. Using a constant value for <math>p_l</math> didn't work well, so instead a linear decay rule <math>p_l = 1 - \frac{l}{L}(1-p_L)</math> was used. In this equation, <math>L</math> is the number of layers, and <math>p_L</math> is the initial parameter.

'''Shake-Shake''' is a regularization method that specifically improves the ResNeXt architecture. It can be given as <math>G(x)=x+\alpha F_1(x)+(1-\alpha)F_2(x)</math>, where <math>\alpha \in [0,1]</math> is a random coefficient. <math>\alpha</math> is used during the forward pass, and another identically distributed random parameter <math>\beta</math> is used in the backward pass. This caused one of the two paired convolution operations to be dropped, and further improved ResNeXt.

[[File:Paper 32.jpg|600px|centre|thumb| Shake-Shake (ResNeXt + Shake-Shake) (Gastaldi, 2017), in which some processing layers omitted for conciseness.]]

=Proposed Method=
We give an intuitive interpretation of the forward pass of Shake-Shake regularization. To the best of our knowledge, it has not been given yet, while the phenomenon in the backward pass is experimentally investigated by Gastaldi (2017). In the forward pass, Shake-Shake interpolates the outputs of two residual branches with a random variable α that controls the degree of interpolation. As DeVries & Taylor (2017a) demonstrated that interpolation of two data in the feature space can synthesize reasonable augmented data, the interpolation of two residual blocks of Shake-Shake in the forward pass can be interpreted as synthesizing data. Use of a random variable α generates many different augmented data. On the other hand, in the backward pass, a different random variable β is used to disturb learning to make the network learnable long time. Gastaldi (2017) demonstrated how the difference between <math>\alpha</math> and <math>\beta</math> affects.

The regularization mechanism of Shake-Shake relies on two or more residual branches, so that it can be applied only to 2-branch networks architectures. In addition, 2-branch network architectures consume more memory than 1-branch network architectures. One may think the number of learnable parameters of ResNeXt can be kept in 1-branch and 2-branch network architectures by controlling its cardinality and the number of channels (filters). For example, a 1-branch network (e.g., ResNeXt 1-64d) and its corresponding 2-branch network (e.g., ResNeXt 2-40d) have almost same number of learnable parameters. However, even so, it increases memory consumption due to the overhead to keep the inputs of residual blocks and so on. By comparing ResNeXt 1-64d and 2-40d, the latter requires more memory than the former by 8% in theory (for one layer) and by 11% in measured values (for 152 layers).

This paper seeks to generalize the method proposed in Shake-Shake to be applied to any residual structure network. Shake-Shake. The initial formulation of 1-branch shake is <math>G(x) = x + \alpha F(x)</math>. In this case, <math>\alpha</math> is a coefficient that disturbs the forward pass, but is not necessarily constrained to be [0,1]. Another corresponding coefficient <math>\beta</math> is used in the backwards pass. Applying this simple adaptation of Shake-Shake on a 110-layer version of PyramidNet with <math>\alpha \in [0,1]</math> and <math>\beta \in [0,1]</math> performs abysmally, with an error rate of 77.99%.

This failure is a result of the setup causing too much perturbation. A trick is needed to promote learning with large perturbations, to preserve the regularization effect. The idea of the authors is to borrow from ResDrop and combine that with Shake-Shake. This works by randomly deciding whether to apply 1-branch shake. This creates in effect two networks, the original network without a regularization component, and a regularized network. When mixing up two networks, we expected the following effects: When the non regularized network is selected, learning is promoted; when the perturbed network is selected, learning is disturbed. Achieving good performance requires a balance between the two.

'''ShakeDrop''' is given as

<div align="center">
<math>G(x) = x + (b_l + \alpha - b_l \alpha)F(x)</math>,
</div>

where <math>b_l</math> is a Bernoulli random variable following the linear decay rule used in Stochastic Depth. An alternative presentation is

<div align="center">
<math>
G(x) = \begin{cases}
x + F(x) ~~ \text{if } b_l = 1 \\
x + \alpha F(x) ~~ \text{otherwise}
\end{cases}
</math>
</div>

If <math>b_l = 1</math> then ShakeDrop is equivalent to the original network, otherwise it is the network + 1-branch Shake. The authors also found that the linear decay rule of ResDrop works well, compared with the uniform rule. Regardless of the value of <math>\beta</math> on the backwards pass, network weights will be updated.

=Experiments=

'''Parameter Search'''

The authors experiments began with a hyperparameter search utilizing ShakeDrop on pyramidal networks. The PyramidNet used was made up of a total of 110 layers which included a convolutional layer and a final fully connected layer. It had 54 additive pyramidal residual blocks and the final residual block had 286 channels. The results of this search are presented below.

[[File:ShakeDropHyperParameterSearch.png|600px|centre|thumb|Average Top-1 errors (%) of “PyramidNet + ShakeDrop” with several ranges of parameters of 4 runs at the final (300th) epoch on CIFAR-100 dataset in the “Batch” level. In some settings, it is equivalent to PyramidNet and PyramidDrop. Borrowed from ShakeDrop Regularization by Yamada et al., 2018.]]

The setting that are used throughout the rest of the experiments are then <math>\alpha \in [-1,1]</math> and <math>\beta \in [0,1]</math>. Cases H and F outperform PyramidNet, suggesting that the strong perturbations imposed by ShakeDrop are functioning as intended. However, fully applying the perturbations in the backwards pass appears to destabilize the network, resulting in performance that is worse than standard PyramidNet.

[[File:ParameterUpdateShakeDrop.png|400px|centre]]

Following this initial parameter decision, the authors tested 4 different strategies for parameter update among "Batch" (same coefficients for all images in minibatch for each residual block), "Image" (same scaling coefficients for each image for each residual block), "Channel" (same scaling coefficients for each element for each residual block), and "Pixel" (same scaling coefficients for each element for each residual block). While Pixel was the best in terms of error rate, it is not very memory efficient, so Image was selected as it had the second best performance without the memory drawback.

'''Comparison with Regularization Methods'''

For these experiments, there are a few modifications that were made to assist with training. For ResNeXt, the EraseRelu formulation has each residual block ends in batch normalization. The Wide ResNet also is compared between vanilla with batch normalization and without. Batch normalization keeps the outputs of residual blocks in a certain range, as otherwise <math>\alpha</math> and <math>\beta</math> could cause perturbations that are too large, causing divergent learning. There is also a comparison of ResDrop/ShakeDrop Type A (where the regularization unit is inserted before the add unit for a residual branch) and after (where the regularization unit is inserted after the add unit for a residual branch).

These experiments are performed on the CIFAR-100 dataset.

[[File:ShakeDropArchitectureComparison1.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison2.png|800px|centre|thumb|]]

[[File:ShakeDropArchitectureComparison3.png|800px|centre|thumb|]]

For a final round of testing, the training setup was modified to incorporate other techniques used in state of the art methods. For most of the tests, the learning rate for the 300 epoch version started at 0.1 and decayed by a factor of 0.1 1/2 & 3/4 of the way through training. The alternative was cosine annealing, based on the presentation by Loshchilov and Hutter in their paper SGDR: Stochastic Gradient Descent with Warm Restarts. This is indicated in the Cos column, with a check indicating cosine annealing.

[[File:CosineAnnealing.png|400px|centre|thumb|]]

The Reg column indicates the regularization method used, either none, ResDrop (RD), Shake-Shake (SS), or ShakeDrop (SD). Fianlly, the Fil Column determines the type of data augmentation used, either none, cutout (CO) (DeVries & Taylor, 2017b), or Random Erasing (RE) (Zhong et al., 2017).

[[File:ShakeDropComparison.png|800px|centre|thumb|Top-1 Errors (%) at final epoch on CIFAR-10/100 datasets]]

'''State-of-the-Art Comparisons'''

A direct comparison with state of the art methods is favorable for this new method.

# Fair comparison of ResNeXt + Shake-Shake with PyramidNet + ShakeDrop gives an improvement of 0.19% on CIFAR-10 and 1.86% on CIFAR-100. Under these conditions, the final error rate is then 2.67% for CIFAR-10 and 13.99% for CIFAR-100.
# Fair comparison of ResNeXt + Shake-Shake + Cutout with PyramidNet + ShakeDrop + Random Erasing gives an improvement of 0.25% on CIFAR-10 and 3.01% on CIFAR 100. Under these conditions, the final error rate is then 2.31% for CIFAR-10 and 12.19% for CIFAR-100.
# Comparison with the state-of-the-arts, PyramidNet + ShakeDrop gives an improvement of 0.25% on CIFAR-10 than ResNeXt + Shake-Shake + Cutout, PyramidNet + ShakeDrop gives an improvement of 2.85% on CIFAR-100 than Coupled Ensemble.

=Implementation details=

'''CIFAR-10/100 datasets'''

All the images in these datasets were color normalized and then horizontally flipped with a probability of 50%. All of then then were zero padded to have a dimentionality of 40 by 40 pixels.

=Conclusion=

This paper proposed a new stochastic regularization method, ShakeDrop, which outperforms previous state of the art methods while maintaining similar memory efficiency. It demonstrates that heavily perturbing a network can help to overcome issues with overfitting. It is also an effective way to regularize residual networks for image classification. The method was tested by CIFAR-10/100 and Tiny ImageNet datasets and showed great performance.

=Critique=

The novelty of this paper is low as pointed out by the reviewers. The proposed ShakeDrop regularization is essentially a combination of the PyramidDrop and Shake-Shake regularization. The most surprising part is that the forward weight can be negative thus inverting the output of a convolution. The mathematical justification for ShakeDrop regularization is limited, relying on intuition and empirical evidence instead.
As pointed out from the above, the method basically relies heavily on the intuition. This means that the performance of the algorithm can not been extended beyond the CIFAR dataset and can vary a lot depending on the characteristics of data sets that users are performing, with some exaggeration. However, the performance is still impressive since it performs better than known algorithms.

=References=
[Yamada et al., 2018] Yamada Y, Iwamura M, Kise K. ShakeDrop regularization. arXiv preprint arXiv:1802.02375. 2018 Feb 7.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

[Zagoruyko & Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proc. BMVC, 2016.

[Han et al., 2017] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Proc. CVPR, 2017a.

[Xie et al., 2017] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. CVPR, 2017.

[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382v3, 2016.

[Gastaldi, 2017] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485v2, 2017.

[Loshilov & Hutter, 2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.

[DeVries & Taylor, 2017b] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.

[Zhong et al., 2017] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

[Dutt et al., 2017] Anuvabh Dutt, Denis Pellerin, and Georges Qunot. Coupled ensembles of neural networks. arXiv preprint 1709.06053v1, 2017.

[Veit et al., 2016] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems 29, 2016.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T17:01:15Z

S366chen: /* Results */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available at [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple lines and green lines seem to stay at the same position in training and testing process. SOCNN and single-layer LSTM are most robust and least prone to overfitting comparing to other networks.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T17:00:26Z

S366chen: /* Results */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available at [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple lines and green lines seem to stay at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T16:57:52Z

S366chen: /* Experiments */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture, the paper also discussed the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available at [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T16:55:10Z

S366chen: /* Model Architecture */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\tilde{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture the paper also discusses the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T16:54:28Z

S366chen: /* Model Architecture */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\tilde{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\hat{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture the paper also discusses the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

stat946F18/Autoregressive Convolutional Neural Networks for Asynchronous Time Series

2018-11-29T16:52:08Z

S366chen: /* Introduction */

This page is a summary of the paper "[http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf Autoregressive Convolutional Neural Networks for Asynchronous Time Series]" by Mikołaj Binkowski, Gautier Marti, Philippe Donnat. It was published at ICML in 2018. The code for this paper is provided [https://github.com/mbinkowski/nntimeseries here].

=Introduction=
In this paper, the authors propose a deep convolutional network architecture called Significance-Offset Convolutional Neural Network for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating systems used in recurrent neural networks. The model is evaluated on various time series data including:
# Hedge fund proprietary dataset of over 2 million quotes for a credit derivative index,
# An artificially generated noisy auto-regressive series,
# A UCI household electricity consumption dataset.

This paper focused on time series that have multivariate and noisy signals, especially financial data. Financial time series is challenging to predict due to their low signal-to-noise ratio and heavy-tailed distributions. For example, the same signal (e.g. price of a stock) is obtained from different sources (e.g. financial news, an investment bank, financial analyst etc.) asynchronously. Each source may have a different bias or noise. ([[Media: Junyi1.png|Figure 1]]) The investment bank with more clients can update their information more precisely than the investment bank with fewer clients, which means the significance of each past observations may depend on other factors that change in time. Therefore, the traditional econometric models such as AR, VAR (Vector Autoregressive Model), VARMA (Vector Autoregressive Moving Average Model) [1] might not be sufficient. However, their relatively good performance could allow us to combine such linear econometric models with deep neural networks that can learn highly nonlinear relationships. This model is inspired by the gating mechanism which is successful in RNNs and Highway Networks.

The time series forecasting problem can be expressed as a conditional probability distribution below,
<div style="text-align: center;"><math>p(X_{t+d}|X_t,X_{t-1},...) = f(X_t,X_{t-1},...)</math></div>
Thus, we focus on modeling the predictors of future values of time series given their past values.

The reasons that financial time series are particularly challenging:
* Low signal-to-noise ratio and heavy-tailed distributions.
* Being observed different sources (e.g. financial news, analysts, portfolio managers in hedge funds, market-makers in investment banks) in asynchronous moments of time. Each of these sources may have a different bias and noise with respect to the original signal that needs to be recovered.
* Data sources are usually strongly correlated and lead-lag relationships are possible (e.g. a market-maker with more clients can update its view more frequently and precisely than one with fewer clients).
* The significance of each of the available past observations might be dependent on some other factors that can change in time. Hence, the traditional econometric models such as AR, VAR, VARMA might not be sufficient.

The predictability of financial dataset still remains an open problem and is discussed in various publications [2].

[[File:Junyi1.png | 500px|thumb|center|Figure 1: Quotes from four different market participants (sources) for the same credit default swaps (CDS) throughout one day. Each trader displays from time to time the prices for which he offers to buy (bid) and sell (ask) the underlying CDS. The filled area marks the difference between the best sell and buy offers (spread) at each time.]]

The paper also provides empirical evidence that their model which combines linear models with deep learning models could perform better than just DL models like CNN, LSTMs and Phased LSTMs.

=Related Work=
===Time series forecasting===
From recent proceedings in main machine learning venues i.e. ICML, NIPS, AISTATS, UAI, we can notice that time series are often forecast using Gaussian processes[3,4], especially for irregularly sampled time series[5]. Though still largely independent, combined models have started to appear, for example, the Gaussian Copula Process Volatility model[6]. For this paper, the authors use coupling AR models and neural networks to achieve such combined models.

Although deep neural networks have been applied into many fields and produced satisfactory results, there still is little literature on deep learning for time series forecasting. More recently, the papers include Sirignano (2016)[7] that used 4-layer perceptrons in modeling price change distributions in Limit Order Books, and Borovykh et al. (2017)[8] who applied more recent WaveNet architecture to several short univariate and bivariate time-series (including financial ones). Heaton et al. (2016)[9] claimed to use autoencoders with a single hidden layer to compress multivariate financial data. Neil et al. (2016)[10] presented augmentation of LSTM architecture suitable for asynchronous series, which stimulates learning dependencies of different frequencies through time gate.

In this paper, the authors examine the capabilities of several architectures (CNN, residual network, multi-layer LSTM, and phase LSTM) on AR-like artificial asynchronous and noisy time series, household electricity consumption dataset, and on real financial data from the credit default swap market with some inefficiencies.

====AR Model====

An autoregressive (AR) model describes the next value in a time-series as a combination of previous values, scaling factors, a bias, and noise [https://onlinecourses.science.psu.edu/stat501/node/358/ (source)]. For a p-th order (relating the current state to the p last states), the equation of the model is:

<math> X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t \,</math> [https://en.wikipedia.org/wiki/Autoregressive_model#Definition (equation source)]

With parameters/coefficients <math>\varphi_i</math>, constant <math>c</math>, and noise <math>\varepsilon_t</math> This can be extended to vector form to create the VAR model mentioned in the paper.

===Gating and weighting mechanisms===
Gating mechanisms for neural networks has ability to overcome the problem of vanishing gradient, and can be expressed as <math display="inline">f(x)=c(x) \otimes \sigma(x)</math>, where <math>f</math> is the output function, <math>c</math> is a "candidate output" (a nonlinear function of <math>x</math>), <math>\otimes</math> is an element-wise matrix product, and <math>\sigma : \mathbb{R} \rightarrow [0,1] </math> is a sigmoid nonlinearity that controls the amount of output passed to the next layer. Different composition of functions of the same type as described above have proven to be an essential ingredient in popular recurrent architecture such as LSTM and GRU[11].

The main purpose of the proposed gating system is to weight the outputs of the intermediate layers within neural networks, and is most closely related to softmax gating used in MuFuRu(Multi-Function Recurrent Unit)[12], i.e.
<math display="inline"> f(x) = \sum_{l=1}^L p^l(x) \otimes f^l(x)\text{,}\ p(x)=\text{softmax}(\widehat{p}(x)), </math>, where <math>(f^l)_{l=1}^L </math>are candidate outputs (composition operators in MuFuRu), <math>(\widehat{p}^l)_{l=1}^L </math>are linear functions of inputs.

This idea is also successfully used in attention networks[13] such as image captioning and machine translation. In this paper, the proposed method is similar as the separate inputs (time series steps in this case) are weighted in accordance with learned functions of these inputs. The difference is that the functions are being modeled using multi-layer CNNs. Another difference is that the proposed method is not using recurrent layers, which enables the network to remember parts of the sentence/image already translated/described.

=Motivation=
There are mainly five motivations that are stated in the paper by the authors:
#The forecasting problem in this paper has been done almost independently by econometrics and machine learning communities. Unlike in machine learning, research in econometrics is more likely to explain variables rather than improving out-of-sample prediction power. These models tend to 'over-fit' on financial time series, their parameters are unstable and have poor performance on out-of-sample prediction.
#It is difficult for the learning algorithms to deal with time series data where the observations have been made irregularly. Although Gaussian processes provide a useful theoretical framework that is able to handle asynchronous data, they are not suitable for financial datasets, which often follow heavy-tailed distribution .
#Predictions of autoregressive time series may involve highly nonlinear functions if sampled irregularly. For AR time series with higher order and have more past observations, the expectation of it <math display="inline">\mathbb{E}[X(t)|{X(t-m), m=1,...,M}]</math> may involve more complicated functions that in general may not allow closed-form expression.
#In practice, the dimensions of multivariate time series are often observed separately and asynchronously, such series at fixed frequency may lead to lose information or enlarge the dataset, which is shown in Figure 2(a). Therefore, the core of the proposed architecture SOCNN represents separate dimensions as a single one with dimension and duration indicators as additional features(Figure 2(b)).
#Given a series of pairs of consecutive input values and corresponding durations, <math display="inline"> x_n = (X(t_n),t_n-t_{n-1}) </math>. One may expect that LSTM may memorize the input values in each step and weight them at the output according to the duration, but this approach may lead to an imbalance between the needs for memory and for linearity. The weights that are assigned to the memorized observations potentially require several layers of nonlinearity to be computed properly, while past observations might just need to be memorized as they are.

[[File:Junyi2.png | 550px|thumb|center|Figure 2: (a) Fixed sampling frequency and its drawbacks; keep- ing all available information leads to much more datapoints. (b) Proposed data representation for the asynchronous series. Consecutive observations are stored together as a single value series, regardless of which series they belong to; this information, however, is stored in indicator features, alongside durations between observations.]]

=Model Architecture=
Suppose there exists a multivariate time series <math display="inline">(x_n)_{n=0}^{\infty} \subset \mathbb{R}^d </math>, we want to predict the conditional future values of a subset of elements of <math>x_n</math>
<div style="text-align: center;"><math>y_n = \mathbb{E} [x_n^I | \{x_{n-m}, m=1,2,...\}], </math></div>
where <math> I=\{i_1,i_2,...i_{d_I}\} \subset \{1,2,...,d\} </math> is a subset of features of <math>x_n</math>.

Let <math> \textbf{x}_n^{-M} = (x_{n-m})_{m=1}^M </math>.

The estimator of <math>y_n</math> can be expressed as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M [F(\textbf{x}_n^{-M}) \otimes \sigma(S(\textbf{x}_n^{-M}))].,_m ,</math></div>
The estimate is the summation of the columns of the matrix in bracket. Here
#<math>F,S : \mathbb{R}^{d \times M} \rightarrow \mathbb{R}^{d_I \times M}</math> are neural networks.
#* <math>S</math> is a fully convolutional network which is composed of convolutional layers only.
#* <math display="inline">F(\textbf{x}_n^{-M}) = W \otimes [\text{off}(x_{n-m}) + x_{n-m}^I)]_{m=1}^M </math>
#** <math> W \in \mathbb{R}^{d_I \times M}</math>
#** <math> \text{off}: \mathbb{R}^d \rightarrow \mathbb{R}^{d_I} </math> is a multilayer perceptron.

#<math>\sigma</math> is a normalized activation function independent at each row, i.e. <math display="inline"> \sigma ((a_1^T, ..., a_{d_I}^T)^T)=(\sigma(a_1)^T,..., \sigma(a_{d_I})^T)^T </math>
#* for any <math>a_{i} \in \mathbb{R}^{M}</math>
#* and <math>\sigma </math> is defined such that <math>\sigma(a)^{T} \mathbf{1}_{M}=1</math> for any <math>a \in \mathbb{R}^M</math>.
# <math>\otimes</math> is element-wise matrix multiplication (also known as Hadamard matrix multiplication).
#<math>A.,_m</math> denotes the m-th column of a matrix A.

Since <math>\sum_{m=1}^M W.,_m=W\cdot(1,1,...,1)^T</math> and <math>\sum_{m=1}^M S.,_m=S\cdot(1,1,...,1)^T</math>, we can express <math>\hat{y}_n</math> as:
<div style="text-align: center;"><math>\hat{y}_n = \sum_{m=1}^M W.,_m \otimes (off(x_{n-m}) + x_{n-m}^I) \otimes \sigma(S.,_m(\textbf{x}_n^{-M}))</math></div>
This is the proposed network, Significance-Offset Convolutional Neural Network, <math>\text{off}</math> and <math>S</math> in the equation are corresponding to Offset and Significance in the name respectively.
Figure 3 shows the scheme of network.

[[File:Junyi3.png | 600px|thumb|center|Figure 3: A scheme of the proposed SOCNN architecture. The network preserves the time-dimension up to the top layer, while the number of features per timestep (filters) in the hidden layers is custom. The last convolutional layer, however, has the number of filters equal to dimension of the output. The Weighting frame shows how outputs from offset and significance networks are combined in accordance with Eq. of <math>\hat{y}_n</math>.]]

The form of <math>\hat{y}_n</math> ensures the separation of the temporal dependence (obtained in weights <math>W_m</math>). <math>S</math>, which represents the local significance of observations, is determined by its filters which capture local dependencies and are independent of the relative position in time, and the predictors <math>\text{off}(x_{n-m})</math> are completely independent of position in time. An adjusted single regressor for the target variable is provided by each past observation through the offset network. Since in asynchronous sampling procedure, consecutive values of x come from different signals and might be heterogeneous, therefore adjustment of offset network is important. In addition, significance network provides data-dependent weight for each regressor and sums them up in an autoregressive manner.

===Relation to asynchronous data===
One common problem of time series is that durations are varying between consecutive observations, the paper states two ways to solve this problem
#Data preprocessing: aligning the observations at some fixed frequency e.g. duplicating and interpolating observations as shown in Figure 2(a). However, as mentioned in the figure, this approach will tend to loss of information and enlarge the size of the dataset and model complexity.
#Add additional features: Treating the duration or time of the observations as additional features, it is the core of SOCNN, which is shown in Figure 2(b).

===Loss function===
The L2 error is a natural loss function for the estimators of expected value: <math>L^2(y,y')=||y-y'||^2</math>

The output of the offset network is series of separate predictors of changes between corresponding observations <math>x_{n-m}^I</math> and the target value<math>y_n</math>, this is the reason why we use auxiliary loss function, which equals to mean squared error of such intermediate predictions:
<div style="text-align: center;"><math>L^{aux}(\textbf{x}_n^{-M}, y_n)=\frac{1}{M} \sum_{m=1}^M ||off(x_{n-m}) + x_{n-m}^I -y_n||^2 </math></div>
The total loss for the sample <math> \textbf{x}_n^{-M},y_n) </math> is then given by:
<div style="text-align: center;"><math>L^{tot}(\textbf{x}_n^{-M}, y_n)=L^2(\widehat{y}_n, y_n)+\alpha L^{aux}(\textbf{x}_n^{-M}, y_n)</math></div>
where <math>\widehat{y}_n</math> was mentioned before, <math>\alpha \geq 0</math> is a constant.

=Experiments=
The paper evaluated SOCNN architecture on three datasets: artificially generated datasets, [https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption household electric power consumption dataset], and the financial dataset of bid/ask quotes provided by several market participants active in the credit derivatives market. Comparing its performance with simple CNN, single and multiplayer LSTM and 25-layer ResNet. Apart from the evaluation of the SOCNN architecture the paper also discusses the impact of network components such as auxiliary
loss and the depth of the offset sub-network. The code and datasets are available [https://github.com/mbinkowski/nntimeseries here]

==Datasets==
Artificial data: They generated 4 artificial series, <math> X_{K \times N}</math>, where <math>K \in \{16,64\} </math>. Therefore there is a synchronous and an asynchronous series for each K value.

Electricity data: This UCI dataset contains 7 different features excluding date and time. The features include global active power, global reactive power, voltage, global intensity, sub-metering 1, sub-metering 2 and sub-metering 3, recorded every minute for 47 months. The data has been altered so that one observation contains only one value of 7 features, while durations between consecutive observations are ranged from 1 to 7 minutes. The goal is to predict all 7 features for the next time step.

Non-anonymous quotes: The dataset contains 2.1 million quotes from 28 different sources from different market participants such as analysts, banks etc. Each quote is characterized by 31 features: the offered price, 28 indicators of the quoting source, the direction indicator (the quote refers to either a buy or a sell offer) and duration from the previous quote. For each source and direction, we want to predict the next quoted price from this given source and direction considering the last 60 quotes.

==Training details==
They applied grid search on some hyperparameters in order to get the significance of its components. The hyperparameters include the offset sub-network's depth and the auxiliary weight <math>\alpha</math>. For offset sub-network's depth, they use 1, 10,1 for artificial, electricity and quotes dataset respectively; and they compared the values of <math>\alpha</math> in {0,0.1,0.01}.

They chose LeakyReLU as activation function for all networks:
<div style="text-align: center;"><math>\sigma^{LeakyReLU}(x) = x</math> if <math>x\geq 0</math>, and <math>0.1x</math> otherwise </div>
They use the same number of layers, same stride and similar kernel size structure in CNN. In each trained CNN, they applied max pooling with the pool size of 2 every 2 convolutional layers.

Table 1 presents the configuration of network hyperparameters used in comparison

[[File:Junyi4.png | 520px|center|]]

===Network Training===
The training and validation data were sampled randomly from the first 80% of timesteps in each series, with ratio of 3 to 1. The remaining 20% of data was used as a test set.

All models were trained using Adam optimizer because the authors found that its rate of convergence was much faster than standard Stochastic Gradient Descent in early tests.

They used a batch size of 128 for artificial and electricity data, and 256 for quotes dataset, and applied batch normalization between each convolution and the following activation.

At the beginning of each epoch, the training samples were randomly sampled. To prevent overfitting, they applied dropout and early stopping.

Weights were initialized using the normalized uniform procedure proposed by Glorot & Bengio (2010).[14]

The authors carried out the experiments on Tensorflow and Keras and used different GPU to optimize the model for different datasets.

==Results==
Table 2 shows all results performed from all datasets.
[[File:Junyi5.png | 800px|center|]]
We can see that SOCNN outperforms in all asynchronous artificial, electricity and quotes datasets. For synchronous data, LSTM might be slightly better, but SOCNN almost has the same results with LSTM. Phased LSTM and ResNet have performed really bad on artificial asynchronous dataset and quotes dataset respectively. Notice that having more than one layer of offset network would have negative impact on results. Also, the higher weights of auxiliary loss(<math>\alpha</math>considerably improved the test error on asynchronous dataset, see Table 3. However, for other datasets, its impact was negligible.
[[File:Junyi6.png | 480px|center|]]
In general, SOCNN has significantly lower variance of the test and validation errors, especially in the early stage of the training process and for quotes dataset. This effect can be seen in the learning curves for Asynchronous 64 artificial dataset presented in Figure 5.
[[File:Junyi7.png | 500px|thumb|center|Figure 5: Learning curves with different auxiliary weights for SOCNN model trained on Asynchronous 64 dataset. The solid lines indicate the test error while the dashed lines indicate the training error.]]

Finally, we want to test the robustness of the proposed model SOCNN, adding noise terms to asynchronous 16 dataset and check how these networks perform. The result is shown in Figure 6.
[[File:Junyi8.png | 600px|thumb|center|Figure 6: Experiment comparing robustness of the considered networks for Asynchronous 16 dataset. The plots show how the error would change if an additional noise term was added to the input series. The dotted curves show the total significance and average absolute offset (not to scale) outputs for the noisy observations. Interestingly, the significance of the noisy observations increases with the magnitude of noise; i.e. noisy observations are far from being discarded by SOCNN.]]
From Figure 6, the purple line and green line seems staying at the same position in training and testing process. SOCNN and single-layer LSTM are most robust compared to other networks, and least prone to overfitting.

=Conclusion and Discussion=
In this paper, the authors have proposed a new architecture called Significance-Offset Convolutional Neural Network, which combines AR-like weighting mechanism and convolutional neural network. This new architecture is designed for high-noise asynchronous time series and achieves outperformance in forecasting several asynchronous time series compared to popular convolutional and recurrent networks.

The SOCNN can be extended further by adding intermediate weighting layers of the same type in the network structure. Another possible extension but needs further empirical studies is that we consider not just <math>1 \times 1</math> convolutional kernels on the offset sub-network. Also, this new architecture might be tested on other real-life datasets with relevant characteristics in the future, especially on econometric datasets and more generally for time series (stochastic processes) regression.

=Critiques=
#The paper is most likely an application paper, and the proposed new architecture shows improved performance over baselines in the asynchronous time series.
#The quote data cannot be reached as they are proprietary. Also, only two datasets available.
#The 'Significance' network was described as critical to the model in paper, but they did not show how the performance of SOCNN with respect to the significance network.
#The transform of the original data to asynchronous data is not clear.
#The experiments on the main application are not reproducible because the data is proprietary.
#The way that train and test data were split is unclear. This could be important in the case of the financial data set.
#Although the auxiliary loss function was mentioned as an important part, the advantages of it was not too clear in the paper. Maybe it is better that the paper describes a little more about its effectiveness.
#It was not mentioned clearly in the paper whether the model training was done on a rolling basis for time series forecasting.
#The noise term used in section 5's model robustness analysis uses evenly distributed noise (see Appendix B). While the analysis is a good start, analysis with different noise distributions would make the findings more generalizable.
#The paper uses financial/economic data as one of its testing data set. Instead of comparing neural network models such as CNN which is known to work badly on time series data, it would be much better if the author compared to well-known econometric time series models such as GARCH and VAR.

=References=
[1] Hamilton, J. D. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[2] Fama, E. F. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[3] Petelin, D., Sˇindela ́ˇr, J., Pˇrikryl, J., and Kocijan, J. Financial modeling using gaussian process models. In Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2011 IEEE 6th International Conference on, volume 2, pp. 672–677. IEEE, 2011.

[4] Tobar, F., Bui, T. D., and Turner, R. E. Learning stationary time series using Gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pp. 3501–3509, 2015.

[5] Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[6] Wilson, A. and Ghahramani, Z. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2010.

[7] Sirignano, J. Extended abstract: Neural networks for limit order books, February 2016.

[8] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional time series forecasting with convolutional neural networks, March 2017.

[9] Heaton, J. B., Polson, N. G., and Witte, J. H. Deep learning in finance, February 2016.

[10] Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances In Neural Information Process- ing Systems, pp. 3882–3890, 2016.

[11] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, December 2014.

[12] Weissenborn, D. and Rockta ̈schel, T. MuFuRU: The Multi-Function recurrent unit, June 2016.

[13] Cho, K., Courville, A., and Bengio, Y. Describing multi- media content using attention-based Encoder–Decoder networks. IEEE Transactions on Multimedia, 17(11): 1875–1886, July 2015. ISSN 1520-9210.

[14] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural net- works. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AIS- TATSaˆ10). Society for Artificial Intelligence and Statistics, 2010.

policy optimization with demonstrations

2018-11-27T09:56:50Z

S366chen: /* Problem Definition */

= Introduction =

The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policies to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore states that have never been seen. 2) Guide the agent to imitate a demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method from demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. To summarize, the authors bring forth this idea through the following techniques:
1) A demonstration guided exploration term measuring the divergence between current and the expert policy is added to the policy optimization objective, increasing the similarity to expert-like exploration
2) They say that for better learning from demonstrations and getting an optimization friendly lower bound, the proposed objective could be defined on an occupancy measure as in [14].
3) Finally, they show that the optimization can move towards optimizing the derived lower bound and the generative adversarial training.
The authors also evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.

==Intuition==
The agent should imitate the demonstrated behavior when rewards are sparse and then explore new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshaped in terms of the native rewards in RL. At present the state of the art exploration in Reinforcement learning is simply epsilon greedy which just makes random moves for a small percentage of times to explore unexplored moves. This is very naive and is one of the main reasons for the high sample complexity in RL. On the other hand, if there is an expert demonstrator who can guide exploration, the agent can make more guided and accurate exploratory moves.

=Related Work =
There are some related works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.

For learning from demonstration (LfD),
# Most LfD methods adopt value-based RL algorithms, such as DQfD (Deep Q-learning from Demonstrations) [2] which are applied into the discrete action spaces and DDPGfD (Deep
Deterministic Policy Gradient from Demonstrations) [3] which extends this idea to the continuous spaces. But both of them under-utilize the demonstration data.
# There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.

For imitation learning,
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.

Both of the above methods are effective for imitation learning, but cannot leverage the valuable feedback given by the environments and usually suffer from bad performance when the expert data is imperfect. That is different from POfD in this paper.

There is also another idea in which an agent learns using hybrid imitation learning and reinforcement learning reward[23, 24]. However, unlike this paper, they did not provide some theoretical support for their method and only explained some intuitive explanations.

=Background=

==Preliminaries==
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma⟩ </math>, where <math>\mathcal{S}</math> is the state space, <math>\mathcal{A} </math> is the action space, <math>\mathcal{P}(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is the discount factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action probabilities, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>:
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.
[[File:def1.png|500px|center]]
Then the performance of <math> \pi </math> can be rewritten to:
[[File:equ2.png|500px|center]]
At the same time, the authors propose a lemma:
[[File:lemma1.png|500px|center]]

==Problem Definition==
Generally, RL tasks and environments do not provide a comprehensive reward and instead rely on sparse feedback indicating whether the goal is reached.

In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where the i-th trajectory <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the unknown expert policy <math>\pi_E </math>. In addition, there is an assumption on the quality of the expert policy:
[[File:asp1.png|500px|center]]

Throughout the paper, they use <math>\pi_E </math> to denote the expert policy that gives the relatively good <math>\eta_\pi </math>, and use <math>\hat{\mathbb{E}}_D </math>to denote empirical expectation estimated from the demonstrated trajectories <math>D^E </math>. We have the following reasonable and necessary assumption on the quality of the expert policy <math>\pi_E </math>.

Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. This is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages.

=Method=

==Policy Optimization with Demonstration (POfD)==
[[File:ff1.png|500px|center]]
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]

==Benefits of Exploration with Demonstrations==
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]
<math>\eta(\pi)</math>is the advantage over the policy <math>\pi_{old}</math> in the previous iteration, so the expression can be rewritten by
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. POfD imposes an additional regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> between <math>\pi_\theta</math> and <math>\pi_{E}</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,
[[File:them1.png|500px|center]]

In fact, POfD brings another factor, <math>D_{J S}^{max}(\pi_{i}, \pi_{E})</math>, that would fully use the advantage <math>{\hat \delta}</math>and add improvements with a margin over pure policy gradient methods.

==Optimization==

For POfD, the authors choose to optimize the lower bound of the Jensen-Shannon divergence instead of directly optimizing the difficult Jensen-Shannon divergence. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：
[[File:them2.png|450px|center]]
Thus, the occupancy measure matching objective can be written as:
[[File:eqnlm.png|450px|center]]
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: \mathcal{S}\times \mathcal{A} \rightarrow (0,1)</math> is an arbitrary mapping function followed by a sigmoid activation function used for scaling, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is:
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]
At this point, the problem closely resembles the minimax problem related to the Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.

At the end, Algorithm 1 gives the detailed process.
[[File:pofd.png|450px|center]]

=Discussion on Existing LfD Methods=

==DQFD==
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.

==DDPGfD==
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.

=Experiments=

==Goal==
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare.

==Settings==
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco (Multi-Joint dynamics with Contact) [5] (Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. MuJoCo is a physics engine aiming to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed. In order to get familiar with OpenAI Gym and Mujoco environment, you can watch these videos, respectively: [http://www.mujoco.org/image/home/mujocodemo.mp4 Mujoco], [https://gym.openai.com/v2018-02-21/videos/SpaceInvaders-v0-4184afb3-1223-4ac6-b52b-8e863cbe24a5/original.mp4 OpenAI Gym]). Due to the uniqueness of the environments, the authors introduce 4 ways to sparsify their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisel 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO (Trust Region Policy Optimization) in the corresponding dense environment.
[[File:pofdt1.png|900px|center]]

==Baselines==
The authors compare POfD against 5 strong baselines:
* training the policy with TRPO [17] in dense environments, which is called expert
* training the policy with TRPO [17] in sparse environments
* applying GAIL [14] to learn the policy from demonstrations
* DQfD [2]
* DDPGfD [3]

==Results==
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converges to the imperfect demonstration in MountainCar [22].

[[File:pofdf2.png|500px|center]]

Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of accumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.
[[File:pofdf3.png|900px|center]]

The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.

The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The environment reward is sparse: every time the arm reaches the ball and holds for a while (e.g., 5 time steps), it receives a reward of +1; otherwise it gets zero reward. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.

=Conclusion=
In this paper, a method, POfD, is proposed that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivness of POfD in encouraging the agent to explore around the nearby region of the expert policy and learn better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.

=Critique=
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behaviour when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.
# The performance of this algorithm hinges on the assumption that expert demonstrations are near optimal in the action space. As seen in figure 3, there appears to be an upper bound to performance near (or just above) the expert accuracy -- this may be an indication of a performance ceiling. In games where near-optimal policies can differ greatly (e.g.; offensive or defensive strategies in chess), the success of the model will depend on the selection of expert demonstrations that are closest to a truly optimal policy (i.e.; just because a policy is the current expert, it does not mean it resembles the true optimal policy).

=References=
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.

[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.

[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.

[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.

[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.

[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.

[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.

[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.

[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.

[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.

[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.

[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.

[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.

[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.

[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.

[23] Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramar, J., Hadsell, R., de Freitas, N., et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.

[24] Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.

policy optimization with demonstrations

2018-11-27T09:56:10Z

S366chen: /* Problem Definition */

= Introduction =

The reinforcement learning (RL) method has made significant progress in a variety of applications, but the exploration problems regarding how to gain more experience from novel policies to improve long-term performance are still challenges, especially in environments where reward signals are sparse and rare. There are currently two ways to solve such exploration problems in RL: 1) Guide the agent to explore states that have never been seen. 2) Guide the agent to imitate a demonstration trajectory sampled from an expert policy to learn. When guiding the agent to imitate the expert behavior for learning, there are also two methods: putting the demonstration directly into the replay memory [1] [2] [3] or using the demonstration trajectory to pre-train the policy in a supervised manner [4]. However, neither of these methods takes full advantage of the demonstration data. To address this problem, a novel policy optimization method from demonstration (POfD) is proposed, which takes full advantage of the demonstration and there is no need to ensure that the expert policy is the optimal policy. To summarize, the authors bring forth this idea through the following techniques:
1) A demonstration guided exploration term measuring the divergence between current and the expert policy is added to the policy optimization objective, increasing the similarity to expert-like exploration
2) They say that for better learning from demonstrations and getting an optimization friendly lower bound, the proposed objective could be defined on an occupancy measure as in [14].
3) Finally, they show that the optimization can move towards optimizing the derived lower bound and the generative adversarial training.
The authors also evaluate the performance of POfD on Mujoco [5] in sparse-reward environments. The experiments results show that the performance of POfD is greatly improved compared with some strong baselines and even to the policy gradient method in dense-reward environments.

==Intuition==
The agent should imitate the demonstrated behavior when rewards are sparse and then explore new states on its own after acquiring sufficient skills, which is a dynamic intrinsic reward mechanism that can be reshaped in terms of the native rewards in RL. At present the state of the art exploration in Reinforcement learning is simply epsilon greedy which just makes random moves for a small percentage of times to explore unexplored moves. This is very naive and is one of the main reasons for the high sample complexity in RL. On the other hand, if there is an expert demonstrator who can guide exploration, the agent can make more guided and accurate exploratory moves.

=Related Work =
There are some related works in overcoming exploration difficulties by learning from demonstration [6] and imitation learning in RL.

For learning from demonstration (LfD),
# Most LfD methods adopt value-based RL algorithms, such as DQfD (Deep Q-learning from Demonstrations) [2] which are applied into the discrete action spaces and DDPGfD (Deep
Deterministic Policy Gradient from Demonstrations) [3] which extends this idea to the continuous spaces. But both of them under-utilize the demonstration data.
# There are some methods based on policy iteration [7] [8], which shapes the value function by using demonstration data. But they get the bad performance when demonstration data is imperfect.
# A hybrid framework [9] that learns the policy in which the probability of taking demonstrated actions is maximized is proposed, which considers less demonstration data.
# A reward reshaping mechanism [10] that encourages taking actions close to the demonstrated ones is proposed. It is similar to the method in this paper, but there exists some differences as it is defined as a potential function based on multi-variate Gaussian to model the distribution of state-actions.
All of the above methods require a lot of perfect demonstrations to get satisfactory performance, which is different from POfD in this paper.

For imitation learning,
# Inverse Reinforce Learning [11] problems are solved by alternating between fitting the reward function and selecting the policy [12] [13]. But it cannot be extended to big-scale problems.
# Generative Adversarial Imitation Learning (GAIL) [14] uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy and it can be applied into the high-dimensional continuous control problems.

Both of the above methods are effective for imitation learning, but cannot leverage the valuable feedback given by the environments and usually suffer from bad performance when the expert data is imperfect. That is different from POfD in this paper.

There is also another idea in which an agent learns using hybrid imitation learning and reinforcement learning reward[23, 24]. However, unlike this paper, they did not provide some theoretical support for their method and only explained some intuitive explanations.

=Background=

==Preliminaries==
Markov Decision Process (MDP) [15] is defined by a tuple <math>⟨\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma⟩ </math>, where <math>\mathcal{S}</math> is the state space, <math>\mathcal{A} </math> is the action space, <math>\mathcal{P}(s'|s,a)</math> is the transition distribution of taking action <math> a </math> at state <math>s </math>, <math> r(s,a) </math>is the reward function, and <math> \gamma </math> is the discount factor between 0 and 1. Policy <math> \pi(a|s) </math> is a mapping from state to action probabilities, the performance of <math> \pi </math> is usually evaluated by its expected discounted reward <math> \eta(\pi) </math>:
\[\eta(\pi)=\mathbb{E}_{\pi}[r(s,a)]=\mathbb{E}_{(s_0,a_0,s_1,...)}[\sum_{t=0}^\infty\gamma^{t}r(s_t,a_t)] \]
The value function is <math> V_{\pi}(s) =\mathbb{E}_{\pi}[r(·,·)|s_0=s] </math>, the action value function is <math> Q_{\pi}(s,a) =\mathbb{E}_{\pi}[r(·,·)|s_0=s,a_0=a] </math>, and the advantage function that reflects the expected additional reward after taking action a at state s is <math> A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)</math>.
Then the authors define Occupancy measure, which is used to estimate the probability that state <math>s</math> and state action pairs <math>(s,a)</math> when executing a certain policy.
[[File:def1.png|500px|center]]
Then the performance of <math> \pi </math> can be rewritten to:
[[File:equ2.png|500px|center]]
At the same time, the authors propose a lemma:
[[File:lemma1.png|500px|center]]

==Problem Definition==
Generally, RL tasks and environments do not provide a comprehensive reward and instead rely on sparse feedback indicating whether the goal is reached.

In this paper, the authors aim to develop a method that can boost exploration by leveraging effectively the demonstrations <math>D^E </math>from the expert policy <math> \pi_E </math> and maximize <math> \eta(\pi) </math> in the sparse-reward environment. The authors define the demonstrations <math>D^E=\{\tau_1,\tau_2,...,\tau_N\} </math>, where the i-th trajectory <math>\tau_i=\{(s_0^i,a_0^i),(s_1^i,a_1^i),...,(s_T^i,a_T^i)\} </math> is generated from the unknown expert policy <math>\pi_E </math>. In addition, there is an assumption on the quality of the expert policy:
[[File:asp1.png|500px|center]]

Throughout the paper, they use <math>\pi_E </math> to denote the expert policy that gives the relatively good <math>\ita_\pi </math>, and use <math>\hat{\mathbb{E}}_D </math>to denote empirical expectation estimated from the demonstrated trajectories <math>D^E </math>. We have the following reasonable and necessary assumption on the quality of the expert policy <math>\pi_E </math>.

Moreover, it is not necessary to ensure that the expert policy is advantageous over all the policies. This is because that POfD will learn a better policy than expert policy by exploring on its own in later learning stages.

=Method=

==Policy Optimization with Demonstration (POfD)==
[[File:ff1.png|500px|center]]
This method optimizes the policy by forcing the policy to explore in the nearby region of the expert policy that is specified by several demonstrated trajectories <math>D^E </math> (as shown in Fig.1) in order to avoid causing slow convergence or failure when the environment feedback is sparse. In addition, the authors encourage the policy π to explore by "following" the demonstrations <math>D^E </math>. Thus, a new learning objective is given:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\pi_{\theta},\pi_{E})\]
where <math>D_{JS}(\pi_{\theta},\pi_{E})</math> is Jensen-Shannon divergence between current policy <math>\pi_{\theta}</math> and the expert policy <math>\pi_{E}</math> , <math>\lambda_1</math> is a trading-off parameter, and <math>\theta</math> is policy parameter. According to Lemma 1, the authors use <math>D_{JS}(\rho_{\theta},\rho_{E})</math> to instead of <math>D_{JS}(\pi_{\theta},\pi_{E})</math>, because it is easier to optimize through adversarial training on demonstrations. The learning objective is:
\[ \mathcal{L}(\pi_{\theta})=-\eta(\pi_{\theta})+\lambda_{1}D_{JS}(\rho_{\theta},\rho_{E})\]

==Benefits of Exploration with Demonstrations==
The authors introduce the benefits of POfD. Firstly, we consider the expression of expected return in policy gradient methods [16].
\[ \eta(\pi)=\eta(\pi_{old})+\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^\infty\gamma^{t}A_{\pi_{old}}(s,a)]\]
<math>\eta(\pi)</math>is the advantage over the policy <math>\pi_{old}</math> in the previous iteration, so the expression can be rewritten by
\[ \eta(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The local approximation to <math>\eta(\pi)</math> up to first order is usually as the surrogate learning objective to be optimized by policy gradient methods due to the difficulties brought by complex dependency of <math>\rho_{\pi}(s)</math> over <math> \pi </math>:
\[ J_{\pi_{old}}(\pi)=\eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)\]
The policy gradient methods improve <math>\eta(\pi)</math> monotonically by optimizing the above <math>J_{\pi_{old}}(\pi)</math> with a sufficiently small update step from <math>\pi_{old}</math> to <math>\pi</math> such that <math>D_{KL}^{max}(\pi, \pi_{old})</math> is bounded [16] [17] [18]. POfD imposes an additional regularization <math>D_{JS}(\pi_{\theta}, \pi_{E})</math> between <math>\pi_\theta</math> and <math>\pi_{E}</math> in order to encourage explorations around regions demonstrated by the expert policy. Theorem 1 shows such benefits,
[[File:them1.png|500px|center]]

In fact, POfD brings another factor, <math>D_{J S}^{max}(\pi_{i}, \pi_{E})</math>, that would fully use the advantage <math>{\hat \delta}</math>and add improvements with a margin over pure policy gradient methods.

==Optimization==

For POfD, the authors choose to optimize the lower bound of the Jensen-Shannon divergence instead of directly optimizing the difficult Jensen-Shannon divergence. This optimization method is compatible with any policy gradient methods. Theorem 2 gives the lower bound of <math>D_{JS}(\rho_{\theta}, \rho_{E})</math>：
[[File:them2.png|450px|center]]
Thus, the occupancy measure matching objective can be written as:
[[File:eqnlm.png|450px|center]]
where <math> D(s,a)=\frac{1}{1+e^{-U(s,a)}}: \mathcal{S}\times \mathcal{A} \rightarrow (0,1)</math> is an arbitrary mapping function followed by a sigmoid activation function used for scaling, and its supremum ranging is like a discriminator for distinguishing whether the state-action pair is a current policy or an expert policy.
To avoid overfitting, the authors add causal entropy <math>−H (\pi_{\theta}) </math> as the regularization term. Thus, the learning objective is:
\[\min_{\theta}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H(\pi_{\theta})+\lambda_{1} \sup_{{D\in(0,1)}^{S\times A}} \mathbb{E}_{\pi_{\theta}}[\log(D(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D(s,a))]\]
At this point, the problem closely resembles the minimax problem related to the Generative Adversarial Networks (GANs) [19]. The difference is that the discriminative model D of GANs is well-trained but the expert policy of POfD is not optimal. Then suppose D is parameterized by w. If it is from an expert policy, <math>D_w</math>is toward 1, otherwise it is toward 0. Thus, the minimax learning objective is:
\[\min_{\theta}\max_{w}\mathcal{L}=-\eta(\pi_{\theta})-\lambda_{2}H (\pi_{\theta})+\lambda_{1}( \mathbb{E}_{\pi_{\theta}}[\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))])\]
The minimax learning objective can be rewritten by substituting the expression of <math> \eta(\pi) </math>:
\[\min_{\theta}\max_{w}-\mathbb{E}_{\pi_{\theta}}[r'(s,a)]-\lambda_{2}H (\pi_{\theta})+\lambda_{1}\mathbb{E}_{\pi_{E}}[\log(1-D_{w}(s,a))]\]
where <math> r'(s,a)=r(a,b)-\lambda_{1}\log(D_{w}(s,a))</math> is the reshaped reward function.
The above objective can be optimized efficiently by alternately updating policy parameters θ and discriminator parameters w, then the gradient is given by:
\[\mathbb{E}_{\pi}[\nabla_{w}\log(D_{w}(s,a))]+\mathbb{E}_{\pi_{E}}[\nabla_{w}\log(1-D_{w}(s,a))]\]
Then, fixing the discriminator <math>D_w</math>, the reshaped policy gradient is:
\[\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[r'(s,a)]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q'(s,a)]\]
where <math>Q'(\bar{s},\bar{a})=\mathbb{E}_{\pi_{\theta}}[r'(s,a)|s_0=\bar{s},a_0=\bar{a}]</math>.

At the end, Algorithm 1 gives the detailed process.
[[File:pofd.png|450px|center]]

=Discussion on Existing LfD Methods=

==DQFD==
DQFD [2] puts the demonstrations into a replay memory D and keeps them throughout the Q-learning process. The objective for DQFD is:
\[J_{DQfD}={\hat{\mathbb{E}}}_{D}[(R_t(n)-Q_w(s_t,a_t))^2]+\alpha{\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]\]
The second term can be rewritten as <math> {\hat{\mathbb{E}}}_{D^E}[(R_t(n)-Q_w(s_t,a_t))^2]={\hat{\mathbb{E}}}_{D^E}[(\hat{\rho}_E(s,a)-\rho_{\pi}(s,a))^{2}r^2(s,a)]</math>, which can be regarded as a regularization forcing current policy's occupancy measure to match the expert's empirical occupancy measure, weighted by the potential reward.

==DDPGfD==
DDPGfD [3] also puts the demonstrations into a replay memory D, but it is based on an actor-critic framework [21]. The objective for DDPGfD is the same as DQFD. Its policy gradient is:
\[\nabla_{\theta}J_{DDPGfD}\approx \mathbb{E}_{s,a}[\nabla_{a}Q_w(s,a)\nabla_{\theta}\pi_{\theta}(s)], a=\pi_{\theta}(s) \]
From this equation, policy is updated relying on learned Q-network <math>Q_w </math>rather than the demonstrations <math>D^{E} </math>. DDPGfD shares the same objective function for <math>Q_w </math> as DQfD, thus they have the same way of leveraging demonstrations, that is the demonstrations in DQfD and DDPGfD induce an occupancy measure matching regularization.

=Experiments=

==Goal==
The authors aim at investigating 1) whether POfD can aid exploration by leveraging a few demonstrations, even though the demonstrations are imperfect. 2) whether POfD can succeed and achieve high empirical return, especially in environments where reward signals are sparse and rare.

==Settings==
The authors conduct the experiments on 8 physical control tasks, ranging from low-dimensional spaces to high-dimensional spaces and naturally sparse environments based on OpenAI Gym [20] and Mujoco (Multi-Joint dynamics with Contact) [5] (Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. MuJoCo is a physics engine aiming to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed. In order to get familiar with OpenAI Gym and Mujoco environment, you can watch these videos, respectively: [http://www.mujoco.org/image/home/mujocodemo.mp4 Mujoco], [https://gym.openai.com/v2018-02-21/videos/SpaceInvaders-v0-4184afb3-1223-4ac6-b52b-8e863cbe24a5/original.mp4 OpenAI Gym]). Due to the uniqueness of the environments, the authors introduce 4 ways to sparsify their built-in dense rewards. TYPE1: a reward of +1 is given when the agent reaches the terminal state, and otherwisel 0. TYPE2: a reward of +1 is given when the agent survives for a while. TYPE3: a reward of +1 is given for every time the agent moves forward over a specific number of units in Mujoco environments. TYPE4: specially designed for InvertedDoublePendulum, a reward +1 is given when the second pole stays above a specific height of 0.89. The details are shown in Table 1. Moreover, only one single imperfect trajectory is used as the demonstrations in this paper. The authors collect the demonstrations by training an agent insufficiently by running TRPO (Trust Region Policy Optimization) in the corresponding dense environment.
[[File:pofdt1.png|900px|center]]

==Baselines==
The authors compare POfD against 5 strong baselines:
* training the policy with TRPO [17] in dense environments, which is called expert
* training the policy with TRPO [17] in sparse environments
* applying GAIL [14] to learn the policy from demonstrations
* DQfD [2]
* DDPGfD [3]

==Results==
Firstly, the authors test the performance of POfD in sparse control environments with discrete actions. From Table 1, POfD achieves performance comparable with the policy learned under dense environments. From Figure 2, only POfD successes to explore sufficiently and achieves great performance in both sparse environments. TRPO [17] and DQFD [2] fail to explore and GAIL [14] converges to the imperfect demonstration in MountainCar [22].

[[File:pofdf2.png|500px|center]]

Then, the authors test the performance of POfD under spares environments with continuous actions space. From Figure 3, POfD achieves expert-level performance in terms of accumulated rewards and surpasses other strong baselines training the policy with TRPO. By watching the learning process of different methods, we can see that TRPO consistently fails to explore the environments when the feedback is sparse, except for HalfCheetah. This may be because there is no terminal state in HalfCheetah, thus a random agent can perform reasonably well as long as the time horizon is sufficiently long. This is shown in Figure3 where the improvement of TRPO begins to show after 400 iterations. DDPGfD and GAIL have common drawback: during training process, they both converge to the imperfect demonstration data. For HalfCheetah, GAIL fails to converge and DDPGfD converges to an even worse point. This situation is expected because the policy and value networks tend to over-fit when having few data, so the training process of GAIL and DDPGfD is severely biased by the imperfect data. Finally, our proposed method can effectively explore the environment with the help of demonstration-based intrinsic reward reshaping, and succeeds consistently across different tasks both in terms of learning stability and convergence speed.
[[File:pofdf3.png|900px|center]]

The authors also implement a locomotion task <math>Humanoid</math>, which teaches a human-like robot to walk. The state space of dimension is 376, which is very hard to render. As a result, POfD still outperformed all three baselike methods, as they failed to learn policies in such a sparse reward environment.

The reacher environment is a task that the target is to control a robot arm to touch an object. the location of the object is random for each instantiation. The environment reward is sparse: every time the arm reaches the ball and holds for a while (e.g., 5 time steps), it receives a reward of +1; otherwise it gets zero reward. The authors select 15 random trajectories as demonstration data, and the performance of POfD is much better than the expert, while all other baseline methods failed.

=Conclusion=
In this paper, a method, POfD, is proposed that can acquire knowledge from a limited amount of imperfect demonstration data to aid exploration in environments with sparse feedback. It is compatible with any policy gradient methods. POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Moreover, the experiments results have shown the validity and effectivness of POfD in encouraging the agent to explore around the nearby region of the expert policy and learn better policies. The key contribution is that POfD helps the agent work with few and imperfect demonstrations in an environment with sparse rewards.

=Critique=
# A novel demonstration-based policy optimization method is proposed. In the process of policy optimization, POfD reshapes the reward function. This new reward function can guide the agent to imitate the expert behaviour when the reward is sparse and explore on its own when the reward value can be obtained, which can take full advantage of the demonstration data and there is no need to ensure that the expert policy is the optimal policy.
# POfD can be combined with any policy gradient methods. Its performance surpasses five strong baselines and can be comparable to the agents trained in the dense-reward environment.
# The paper is structured and the flow of ideas is easy to follow. For related work, the authors clearly explain similarities and differences among these related works.
# This paper's scalability is demonstrated. The experiments environments are ranging from low-dimensional spaces to high-dimensional spaces and from discrete action spaces to continuous actions spaces. For future work, can it be realized in the real world?
# There is a doubt that whether it is a correct method to use the trajectory that was insufficiently learned in dense-reward environment as the imperfect demonstration.
# In this paper, the performance only is judged by the cumulative reward, can other evaluation terms be considered? For example, the convergence rate.
# The performance of this algorithm hinges on the assumption that expert demonstrations are near optimal in the action space. As seen in figure 3, there appears to be an upper bound to performance near (or just above) the expert accuracy -- this may be an indication of a performance ceiling. In games where near-optimal policies can differ greatly (e.g.; offensive or defensive strategies in chess), the success of the model will depend on the selection of expert demonstrations that are closest to a truly optimal policy (i.e.; just because a policy is the current expert, it does not mean it resembles the true optimal policy).

=References=
[1] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.

[2] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.

[3] Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rotho ̈rl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[4] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[5] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Con- ference on, pp. 5026–5033. IEEE, 2012.

[6] Schaal, S. Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046, 1997.

[7] Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.

[8] Piot, B., Geist, M., and Pietquin, O. Boosted bellman resid- ual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 549–564. Springer, 2014.

[9] Aravind S. Lakshminarayanan, Sherjil Ozair, Y. B. Rein- forcement learning with few expert demonstrations. In NIPS workshop, 2016.

[10] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Tay- lor, M. E., and Nowe ́, A. Reinforcement learning from demonstration through shaping. In IJCAI, pp. 3352–3358, 2015.

[11] Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.

[12] Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural informa- tion processing systems, pp. 1449–1456, 2008.

[13] Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, 2008.

[14] Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- tems, pp. 4565–4573, 2016.

[15] Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

[16] Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.

[17] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.

[18] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[20] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

[21] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[22] Moore, A. W. Efficient memory-based learning for robot control. 1990.

[23] Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kramar, J., Hadsell, R., de Freitas, N., et al. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.

[24] Li, Y., Song, J., and Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.

Visual Reinforcement Learning with Imagined Goals

2018-11-27T08:06:06Z

S366chen: /* Algorithm */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarised below. A Variational Auto Encoder (VAE) on the (left) is trained to learn a latent representation of images gathered during training time (center). These latent variables can then be used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png | 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al.[11] proposed time-varying models which require episodic setups. There are also other works such as Pinto et al.[12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] uses fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and
immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math>, can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> and <math>\gamma</math> is a discount factor. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, (s,a,s′). However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the equation <math display="inline"> r(s, g) = - || z - z_g ||_A \propto \sqrt{log(e_{\Phi}(z_g | s))} </math>

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

Visual Reinforcement Learning with Imagined Goals

2018-11-27T08:05:47Z

S366chen: /* Algorithm */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarised below. A Variational Auto Encoder (VAE) on the (left) is trained to learn a latent representation of images gathered during training time (center). These latent variables can then be used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png | 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al.[11] proposed time-varying models which require episodic setups. There are also other works such as Pinto et al.[12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] uses fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and
immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math>, can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> and <math>\gamma</math> is a discount factor. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, (s,a,s′). However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the following equation :

<math display="inline"> r(s, g) = - || z - z_g ||_A \propto \sqrt{log(e_{\Phi}(z_g | s))} </math>

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

Visual Reinforcement Learning with Imagined Goals

2018-11-27T08:05:30Z

S366chen: /* Algorithm */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarised below. A Variational Auto Encoder (VAE) on the (left) is trained to learn a latent representation of images gathered during training time (center). These latent variables can then be used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png | 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al.[11] proposed time-varying models which require episodic setups. There are also other works such as Pinto et al.[12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] uses fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and
immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math>, can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> and <math>\gamma</math> is a discount factor. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, (s,a,s′). However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the following equation :

<math display="inline"> r(s, g) = - || z - z_g ||_A \propto \sqrt(log(e_{\Phi}(z_g | s))) </math>

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

Visual Reinforcement Learning with Imagined Goals

2018-11-27T08:04:04Z

S366chen: /* Algorithm */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarised below. A Variational Auto Encoder (VAE) on the (left) is trained to learn a latent representation of images gathered during training time (center). These latent variables can then be used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png | 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al.[11] proposed time-varying models which require episodic setups. There are also other works such as Pinto et al.[12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] uses fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and
immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math>, can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> and <math>\gamma</math> is a discount factor. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, (s,a,s′). However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the following equation :

<math display="inline">r(s, g) = - || z - z_g ||_A \prop </math>

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

Visual Reinforcement Learning with Imagined Goals

2018-11-27T08:02:35Z

S366chen: /* Algorithm */

Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]

=Introduction and Motivation=

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarised below. A Variational Auto Encoder (VAE) on the (left) is trained to learn a latent representation of images gathered during training time (center). These latent variables can then be used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

[[File: WF_Sec_11Nov25_01.png | 800px]]

=Related Work =

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviours such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al.[11] proposed time-varying models which require episodic setups. There are also other works such as Pinto et al.[12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] uses fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and
immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

=Goal-Conditioned Reinforcement Learning=

The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math>, can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> and <math>\gamma</math> is a discount factor. In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.

[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

[[File:ql.png|center|600px]]

From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:

<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>

where <math>\overline{w}</math> is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, (s,a,s′). However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:

[[File:qlr.png|center|600px]]

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

=Variational Autoencoder=
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. This generative model converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.

[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using Equation (3)

=Experiments=

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The oracle, as a baseline, was given true object location information, as opposed to visual pixel information.

[[File:WF_Sec_11Nov_25_02.png|1000px]]

They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward.

=Conclusion & Future Work=

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. [10] A new paper was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blury to be used goal images. It would be better if this can be investigated in future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario, and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spends moving between the multiple objects in the scene (which it currently does quite frequently).

=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

Unsupervised Neural Machine Translation

2018-11-27T05:54:51Z

S366chen: /* Experiments and Results */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist.

Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.
Then iteratively perform:
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. They improve the continuous Skip-gram model for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.

Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back translated to the source language and compared with the original sentence.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.

= Methodology =

The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experimented with an alternative way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016](Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).

The tokens were then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is language is different. This way a word which occurs in two different languages but has a different meaning in those languages would get a different vector in each of these languages despite being in the same vector space.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the of a language to decode the sentence into the correct order.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model was evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trained translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of backtranslation, however, does show large improvement on all tested cases.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-27T05:54:10Z

S366chen: /* Methodology */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist.

Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.
Then iteratively perform:
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. They improve the continuous Skip-gram model for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.

Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back translated to the source language and compared with the original sentence.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.

= Methodology =

The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experimented with an alternative way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016](Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).

The tokens were then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is language is different. This way a word which occurs in two different languages but has a different meaning in those languages would get a different vector in each of these languages despite being in the same vector space.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the of a language to decode the sentence into the correct order.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of backtranslation, however, does show large improvement on all tested cases.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

Unsupervised Neural Machine Translation

2018-11-27T05:53:48Z

S366chen: /* Methodology */

This paper was published in ICLR 2018, authored by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Open source implementation of this paper is available [https://github.com/artetxem/undreamt here]

= Introduction =
The paper presents an unsupervised Neural Machine Translation (NMT) method that uses monolingual corpora (single language texts) only. This contrasts with the usual supervised NMT approach which relies on parallel corpora (aligned text) from the source and target languages being available for training. This problem is important because parallel pairing for a majority of languages, e.g. for German-Russian, do not exist.

Other authors have recently tried to address this problem using semi-supervised approaches (small set of parallel corpora). However, these methods still require a strong cross-lingual signal. The proposed method eliminates the need for cross-lingual information all together and relies solely on monolingual data. The proposed method builds upon the work done recently on unsupervised cross-lingual embeddings by Artetxe et al., 2017 and Zhang et al., 2017.

The general approach of the methodology is to:

# Use monolingual corpora in the source and target languages to learn single language word embeddings for both languages separately.
# Align the 2 sets of word embeddings into a single cross lingual (language independent) embedding.
Then iteratively perform:
# Train an encoder-decoder model to reconstruct noisy versions of sentences in both source and target languages separately. The model uses a single encoder and different decoders for each language. The encoder uses cross lingual word embedding.
# Tune the decoder in each language by back-translating between the source and target language.

= Background =

===Word Embedding Alignment===

The paper uses word2vec [Mikolov, 2013] to convert each monolingual corpora to vector embeddings. They improve the continuous Skip-gram model for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. These embeddings have been shown to contain the contextual and syntactic features independent of language, and so, in theory, there could exist a linear map that maps the embeddings from language L1 to language L2.

Figure 1 shows an example of aligning the word embeddings in English and French.

[[File:Figure1_lwali.png|frame|400px|center|Figure 1: the word embeddings in English and French (a & b), and (c) shows the aligned word embeddings after some linear transformation.[Gouws,2016]]]

Most cross-lingual word embedding methods use bilingual signals in the form of parallel corpora. Usually, the embedding mapping methods train the embeddings in different languages using monolingual corpora, then use a linear transformation to map them into a shared space based on a bilingual dictionary.

The paper uses the methodology proposed by [Artetxe, 2017] to do cross-lingual embedding aligning in an unsupervised manner and without parallel data. Without going into the details, the general approach of this paper is starting from a seed dictionary of numeral pairings (e.g. 1-1, 2-2, etc.), to iteratively learn the mapping between 2 language embeddings, while concurrently improving the dictionary with the learned mapping at each iteration. This is in contrast to earlier work which used dictionaries of a few thousand words.

===Other related work and inspirations===
====Statistical Decipherment for Machine Translation====
There has been significant work in statistical deciphering techniques (decipherment is the discovery of the meaning of texts written in ancient or obscure languages or scripts) to develop a machine translation model from monolingual data (Ravi & Knight, 2011; Dou & Knight, 2012). These techniques treat the source language as ciphertext (encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher for decoding) and model the generation process of the ciphertext as a two-stage process, which includes the generation of the original English sequence and the probabilistic replacement of the words in it. This approach takes advantage of the incorporation of syntactic knowledge of the languages. The use of word embeddings has also shown improvements in statistical decipherment.

====Low-Resource Neural Machine Translation====
There are also proposals that use techniques other than direct parallel corpora to do NMT. Some use a third intermediate language that is well connected to the source and target languages independently. For example, if we want to translate German into Russian, we can use English as an intermediate language (German-English and then English-Russian) since there are plenty of resources to connect English and other languages. Johnson et al. (2017) show that a multilingual extension of a standard NMT architecture performs reasonably well for language pairs when no parallel data for the source and target data was used during training. Firat et al. (2016) and Chen et al. (2017) showed that the use of advanced models like teacher-student framework can be used to improve over the baseline of translating using a third intermediate language.

Other works use monolingual data in combination with scarce parallel corpora. A simple but effective technique is back-translation [Sennrich et al, 2016]. First, a synthetic parallel corpus in the target language is created. Translated sentence and back translated to the source language and compared with the original sentence.

The most important contribution to the problem of training an NMT model with monolingual data was from [He, 2016], which trains two agents to translate in opposite directions (e.g. French → English and English → French) and teach each other through reinforcement learning. However, this approach still required a large parallel corpus for a warm start (about 1.2 million sentences), while this paper does not use parallel data.

= Methodology =

The corpora data is first preprocessed in a standard way to tokenize and case the words. The authors also experimented with an alternative way of tokenizing words by using Byte-Pair Encoding (BPE) [Sennrich, 2016](Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data). BPE has been shown to improve embeddings of rare-words. The vocabulary was limited to the most frequent 50,000 tokens (BPE tokens or words).

The tokens are then converted to word embeddings using word2vec with 300 dimensions and then aligned between languages using the method proposed by [Artetxe, 2017]. The alignment method proposed by [Artetxe, 2017] is also used as a baseline to evaluate this model as discussed later in Results.

The translation model uses a standard encoder-decoder model with attention. The encoder is a 2-layer bidirectional RNN, and the decoder is a 2 layer RNN. All RNNs use GRU cells with 600 hidden units. The encoder is shared by the source and target language, while the decoder is different for each language.

Although the architecture uses standard models, the proposed system differs from the standard NMT through 3 aspects:

#Dual structure: NMT usually are built for one direction translations English<math>\rightarrow</math>French or French<math>\rightarrow</math>English, whereas the proposed model trains both directions at the same time translating English<math>\leftrightarrow</math>French.
#Shared encoder: one encoder is shared for both source and target languages in order to produce a representation in the latent space independent of language, and each decoder learns to transform the representation back to its corresponding language.
#Fixed embeddings in the encoder: Most NMT systems initialize the embeddings and update them during training, whereas the proposed system trains the embeddings in the beginning and keeps these fixed throughout training, so the encoder receives language-independent representations of the words. This approach ensures that the encoder only learns how to compose the language independent representations to build representations of the larger phrases. This requires existing unsupervised methods to create embeddings using monolingual corpora as discussed in the background. In the proposed method, even though the embeddings used are cross-lingual, the vocabulary used for each language is language is different. This way a word which occurs in two different languages but has a different meaning in those languages would get a different vector in each of these languages despite being in the same vector space.

[[File:Figure2_lwali.png|600px|center]]

The translation model iteratively improves the encoder and decoder by performing 2 tasks: Denoising, and Back-translation.

===Denoising===
Random noise is added to the input sentences in order to allow the model to learn some structure of languages. Without noise, the model would simply learn to copy the input word by word. Noise also allows the shared encoder to compose the embeddings of both languages in a language-independent fashion, and then be decoded by the language dependent decoder.

Denoising works by reconstructing a noisy version of a sentence back into the original sentence in the same language. In mathematical form, if <math>x</math> is a sentence in language L1:

# Construct <math>C(x)</math>, noisy version of <math>x</math>. In the proposed model, <math>C(x)</math> is constructed by randomly swapping contiguous words. If the length of the input sequence <math>x</math> is <math>N</math>, then a total of <math>\frac{N}{2}</math> such swaps are made.
# Input <math>C(x)</math> into the current iteration of the shared encoder and use decoder for L1 to get reconstructed <math>\hat{x}</math>.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

In other words, the whole system is optimized to take an input sentence in a given language, encode it using the shared encoder, and reconstruct the original sentence using the decoder of that language.

The proposed noise function is to perform <math>N/2</math> random swaps of words that are contiguous, where <math>N</math> is the number of words in the sentence. This noise model also helps reduce the reliance of the model on the order of words in a sentence which may be different in the source and target languages. The system will also need to correctly learn the of a language to decode the sentence into the correct order.

===Back-Translation===

With only denoising, the system doesn't have a goal to improve the actual translation. Back-translation works by using the decoder of the target language to create a translation, then encoding this translation and decoding again using the source decoder to reconstruct a the original sentence. In mathematical form, if <math>C(x)</math> is a noisy version of sentence <math>x</math> in language L1:

# Input <math>C(x)</math> into the current iteration of shared encoder and the decoder in L2 to construct translation <math>y</math> in L2,
# Construct <math>C(y)</math>, noisy version of translation <math>y</math>,
# Input <math>C(y)</math> into the current iteration of shared encoder and the decoder in L1 to reconstruct <math>\hat{x}</math> in L1.

The training objective is to minimize the cross entropy loss between <math>{x}</math> and <math>\hat{x}</math>.

Contrary to standard back-translation that uses an independent model to back-translate the entire corpus at one time, the system uses mini-batches and the dual architecture to generate pseudo-translations and then train the model with the translation, improving the model iteratively as the training progresses.

===Training===

Training is done by alternating these 2 objectives from mini-batch to mini-batch. Each iteration would perform one mini-batch of denoising for L1, another one for L2, one mini-batch of back-translation from L1 to L2, and another one from L2 to L1. The procedure is repeated until convergence.
During decoding, greedy decoding was used at training time for back-translation, but actual inference at test time was done using beam-search with a beam size of 12.

Optimizer choice and other hyperparameters can be found in the paper.

=Experiments and Results=

The model is evaluated using the Bilingual Evaluation Understudy (BLEU) Score, which is typically used to evaluate the quality of the translation, using a reference (ground-truth) translation.

The paper trains translation model under 3 different settings to compare the performance (Table 1). All training and testing data used was from a standard NMT dataset, WMT'14.

[[File:Table1_lwali.png|600px|center]]

The results show that backtranslation is essential for the proposed system to work properly. The denoising technique alone is below the baseline while big improvements appear when introducing backtranslation.

===Unsupervised===

The model only has access to monolingual corpora, using the News Crawl corpus with articles from 2007 to 2013. The baseline for unsupervised is the method proposed by [Artetxe, 2017], which was the unsupervised word vector alignment method discussed in the Background section.

The paper adds each component piece-wise when doing an evaluation to test the impact each piece has on the final score. As shown in Table 1, Unsupervised results compared to the baseline of word-by-word results are strong, with improvement between 40% to 140%. Results also show that back-translation is essential. Denoising doesn't show a big improvement however it is required for back-translation, because otherwise, back-translation would translate nonsensical sentences. The addition of backtranslation, however, does show large improvement on all tested cases.

For the BPE experiment, results show it helps in some language pairs but detract in some other language pairs. This is because while BPE helped to translate some rare words, it increased the error rates in other words. It also did not perform well when translating named entities which occur infrequently.

===Semi-supervised===

Since there is often some small parallel data but not enough to train a Neural Machine Translation system, the authors test a semi-supervised setting with the same monolingual data from the unsupervised settings together with either 10,000 or 100,000 random sentence pairs from the News Commentary parallel corpus. The supervision is included to improve the model during the back-translation stage to directly predict sentences that are in the parallel corpus.

Table 1 shows that the model can greatly benefit from the addition of a small parallel corpus to the monolingual corpora. It is surprising that semi-supervised in row 6 outperforms supervised in row 7, one possible explanation is that both the semi-supervised training set and the test set belong to the news domain, whereas the supervised training set is all domains of corpora.

===Supervised===

This setting provides an upper bound to the unsupervised proposed system. The data used was the combination of all parallel corpora provided at WMT 2014, which includes Europarl, Common Crawl and News Commentary for both language pairs plus the UN and the Gigaword corpus for French- English. Moreover, the authors use the same subsets of News Commentary alone to run the separate experiments in order to compare with the semi-supervised scenario.

The Comparable NMT was trained using the same proposed model except it does not use monolingual corpora, and consequently, it was trained without denoising and back-translation. The proposed model under a supervised setting does much worse than the state of the NMT in row 10, which suggests that adding the additional constraints to enable unsupervised learning also limits the potential performance. To improve these results, the authors also suggest using larger models, longer training times, and incorporating several well-known NMT techniques.

===Qualitative Analysis===

[[File:Table2_lwali.png|600px|center]]

Table 2 shows 4 examples of French to English translations, which shows that the high-quality translations are produced by the proposed system, and this system adequately models non-trivial translation relations. Example 1 and 2 show that the model is able to not only go beyond a literal word-by-word substitution but also model structural differences in the languages (ex.e, it correctly translates "l’aeroport international de Los Angeles" as "Los Angeles International Airport", and it is capable of producing high-quality translations of long and more complex sentences. However, in Example 3 and 4, the system failed to translate the months and numbers correctly and having difficulty with comprehending odd sentence structures, which means that the proposed system has limitations. Specially, the authors points that the proposed model has difficulties to preserve some concrete details from source sentences. Results also show, the proposed model's translation quality often lags behind that of a standard supervised NMT system and also there are also some cases where there are both fluency and adequacy problems that severely hinders understanding the original message from the proposed translation, suggesting that there is still room for improvement and possible future work.

=Conclusions and Future Work=

The paper presented an unsupervised model to perform translations with monolingual corpora by using an attention-based encoder-decoder system and training using denoise and back-translation.

Although experimental results show that the proposed model is effective as an unsupervised approach, there is significant room for improvement when using the model in a supervised way, suggesting the model is limited by the architectural modifications. Some ideas for future improvement include:
*Instead of using fixed cross-lingual word embeddings at the beginning which forces the encoder to learn a common representation for both languages, progressively update the weight of the embeddings as training progresses.
*Decouple the shared encoder into 2 independent encoders at some point during training
*Progressively reduce the noise level
*Incorporate character level information into the model, which might help address some of the adequacy issues observed in our manual analysis
*Use other noise/denoising techniques, and analyze their effect in relation to the typological divergences of different language pairs.

= Critique =

While the idea is interesting and the results are impressive for an unsupervised approach, much of the model had actually already been proposed by other papers that are referenced. The paper doesn't add a lot of new ideas but only builds on existing techniques and combines them in a different way to achieve good experimental results. The paper is not a significant algorithmic contribution.

As pointed out, in order to critically analyze the effect of the algorithm, we need to formulate the algorithm in terms of mathematics.

The results showed that the proposed system performed far worse than the state of the art when used in a supervised setting, which is concerning and shows that the techniques used creates a limitation and a ceiling for performance.

Additionally, there was no rigorous hyperparameter exploration/optimization for the model. As a result, it is difficult to conclude whether the performance limit observed in the constrained supervised model is the absolute limit, or whether this could be overcome in both supervised/unsupervised models with the right constraints to achieve more competitive results.

The best results shown are between two very closely related languages(English and French), and does much worse for English - German, even though English and German are also closely related (but less so than English and French) which suggests that the model may not be successful at translating between distant language pairs. More testing would be interesting to see.

The results comparison could have shown how the semi-supervised version of the model scores compared to other semi-supervised approaches as touched on in the other works section.

Their qualitative analysis just checks whether their proposed unsupervised NMT generates a sensible translation. It is limited and it needs further detailed analysis regarding the characteristics and properties of translation which is generated by unsupervised NMT.

* (As pointed out by an anonymous reviewer [https://openreview.net/forum?id=Sy2ogebAW])Future work is vague: “we would like to detect and mitigate the specific causes…” “We also think that a better handling of rare words…” That’s great, but how will you do these things? Do you have specific reasons to think this, or ideas on how to approach them? Otherwise, this is just hand-waving.

= References =
#'''[Mikolov, 2013]''' Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality."
#'''[Artetxe, 2017]''' Mikel Artetxe, Gorka Labaka, Eneko Agirre, "Learning bilingual word embeddings with (almost) no bilingual data".
#'''[Gouws,2016]''' Stephan Gouws, Yoshua Bengio, Greg Corrado, "BilBOWA: Fast Bilingual Distributed Representations without Word Alignments."
#'''[He, 2016]''' Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. "Dual learning for machine translation."
#'''[Sennrich,2016]''' Rico Sennrich and Barry Haddow and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units."
#'''[Ravi & Knight, 2011]''' Sujith Ravi and Kevin Knight, "Deciphering foreign language."
#'''[Dou & Knight, 2012]''' Qing Dou and Kevin Knight, "Large scale decipherment for out-of-domain machine translation."
#'''[Johnson et al. 2017]''' Melvin Johnson,et al, "Google’s multilingual neural machine translation system: Enabling zero-shot translation."
#'''[Zhang et al. 2017]''' Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. "Adversarial training for unsupervised bilingual lexicon induction"

conditional neural process

2018-11-20T23:54:17Z

S366chen: /* Experimental Result IV: Classification */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|400px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-20T23:54:10Z

S366chen: /* Experimental Result IV: Classification */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|500px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-20T23:54:00Z

S366chen: /* Experimental Result IV: Classification */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|300px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-20T23:53:50Z

S366chen: /* Experimental Result IV: Classification */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:001.jpg|300px|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-20T23:53:22Z

S366chen: /* Reference */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.

Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.

Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.

Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.

Edwards, H. and Storkey, A. Towards a neural statistician.
2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.

Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.

Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.

J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.

Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.

Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.

Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.

Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.

Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-20T23:52:08Z

S366chen: /* Reference */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==
Bartunov, S. and Vetrov, D. P. Fast adaptation in generative
models with generative matching networks. arXiv
preprint arXiv:1612.02192, 2016.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. arXiv preprint
arXiv:1505.05424, 2015.
Bornschein, J., Mnih, A., Zoran, D., and J. Rezende, D.
Variational memory addressing in generative models. In
Advances in Neural Information Processing Systems, pp.
3923–3932, 2017.
Damianou, A. and Lawrence, N. Deep gaussian processes.
In Artificial Intelligence and Statistics, pp. 207–215,
2013.
Devlin, J., Bunel, R. R., Singh, R., Hausknecht, M., and
Kohli, P. Neural program meta-induction. In Advances in
Neural Information Processing Systems, pp. 2077–2085,
2017.
Edwards, H. and Storkey, A. Towards a neural statistician.
2016.
Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning
for fast adaptation of deep networks. arXiv
preprint arXiv:1703.03400, 2017.
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation:
Representing model uncertainty in deep learning.
In international conference on machine learning, pp.
1050–1059, 2016.
Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards
deep symbolic reinforcement learning. arXiv preprint
arXiv:1609.05518, 2016.
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
Wierstra, D. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623, 2015.
Hewitt, L., Gane, A., Jaakkola, T., and Tenenbaum, J. B. The
variational homoencoder: Learning to infer high-capacity
generative models from few examples. 2018.
J. Rezende, D., Danihelka, I., Gregor, K., Wierstra, D.,
et al. One-shot generalization in deep generative models.
In International Conference on Machine Learning, pp.
1521–1529, 2016.
Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural
networks for one-shot image recognition. In ICML Deep
Learning Workshop, volume 2, 2015.
Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.
Human-level concept learning through probabilistic program
induction. Science, 350(6266):1332–1338, 2015.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman,
S. J. Building machines that learn and think like
people. Behavioral and Brain Sciences, 40, 2017.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased
learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face
attributes in the wild. In Proceedings of International
Conference on Computer Vision (ICCV), December 2015.
Louizos, C. and Welling, M. Multiplicative normalizing
flows for variational bayesian neural networks. arXiv
preprint arXiv:1703.01961, 2017.
Louizos, C., Ullrich, K., and Welling, M. Bayesian compression
for deep learning. In Advances in Neural Information
Processing Systems, pp. 3290–3300, 2017.
Rasmussen, C. E. and Williams, C. K. Gaussian processes
in machine learning. In Advanced lectures on machine
learning, pp. 63–71. Springer, 2004.
Reed, S., Chen, Y., Paine, T., Oord, A. v. d., Eslami, S.,
J. Rezende, D., Vinyals, O., and de Freitas, N. Few-shot
autoregressive density estimation: Towards learning to
learn distributions. 2017.
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative
models. arXiv preprint arXiv:1401.4082, 2014.
Salimbeni, H. and Deisenroth, M. Doubly stochastic variational
inference for deep gaussian processes. In Advances
in Neural Information Processing Systems, pp.
4591–4602, 2017.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and
Lillicrap, T. One-shot learning with memory-augmented
neural networks. arXiv preprint arXiv:1605.06065, 2016.
Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pp. 4080–4090, 2017.
Conditional Neural Processes
Snelson, E. and Ghahramani, Z. Sparse gaussian processes
using pseudo-inputs. In Advances in neural information
processing systems, pp. 1257–1264, 2006.
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In Advances in Neural Information
Processing Systems, pp. 4790–4798, 2016.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
Neural Information Processing Systems, pp. 3630–3638,
2016.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and
Botvinick, M. Learning to reinforcement learn. arXiv
preprint arXiv:1611.05763, 2016.
Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P.
Deep kernel learning. In Artificial Intelligence and Statistics,
pp. 370–378, 2016.

conditional neural process

2018-11-20T23:51:50Z

S366chen: /* Conclusion */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

== Reference ==

conditional neural process

2018-11-20T23:50:25Z

S366chen: /* Experimental Result III: Image Completion for Faces */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:003.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.

[[File:004.jpg|400px|center]]

An important aspect of CNPs demonstrated in the above figure is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but
usually this would not be enough to capture the interesting
properties).

In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.

[[File:010.jpg|400px|center]]

They compared CNPs quantitatively to two related models:
kNNs and GPs. As shown in the above table CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

File:010.jpg

2018-11-20T23:49:49Z

S366chen:

conditional neural process

2018-11-20T23:40:19Z

S366chen: /* Experimental Result II: Image Completion for Digits */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|600px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:004.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.
An important aspect of CNPs demonstrated in Figure 5, is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but

usually this would not be enough to capture the interesting
properties).
In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

conditional neural process

2018-11-20T23:40:07Z

S366chen: /* Model */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|300px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|700px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:004.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.
An important aspect of CNPs demonstrated in Figure 5, is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but

usually this would not be enough to capture the interesting
properties).
In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

conditional neural process

2018-11-20T23:39:57Z

S366chen: /* Experimental Result I: Function Regression */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|700px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|300px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|700px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:004.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.
An important aspect of CNPs demonstrated in Figure 5, is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but

usually this would not be enough to capture the interesting
properties).
In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

conditional neural process

2018-11-20T23:39:49Z

S366chen: /* Experimental Result III: Image Completion for Faces */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|700px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|600px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|700px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:004.jpg|400px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.
An important aspect of CNPs demonstrated in Figure 5, is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but

usually this would not be enough to capture the interesting
properties).
In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

conditional neural process

2018-11-20T23:39:40Z

S366chen: /* Experimental Result III: Image Completion for Faces */

== Introduction ==

To train a model effectively, deep neural networks typically require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach: the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned. Taking a probabilistic stance and specifying a distribution over functions (stochastic processes) is another approach -- Gaussian Processes being a commonly used example of this. Such Bayesian methods can be computationally expensive, however.

The authors of the paper propose a family of models that represent solutions to the supervised problem, and an end-to-end training approach to learning them that combines neural networks with features reminiscent of Gaussian Processes. They call this family of models Conditional Neural Processes (CNPs). CNPs can be trained on very few data points to make accurate predictions, while they also have the capacity to scale to complex functions and large datasets.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg|700px|center]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg|600px|center]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg|700px|center]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Image Completion for Faces ==

[[File:004.jpg|700px|center]]

They also applied CNP to CelebA, a dataset of images of
celebrity faces, and reported performance obtained on the
test set.

As shown in the above figure our model is able to capture
the complex shapes and colours of this dataset with predictions
conditioned on less than 10% of the pixels being
already close to ground truth. As before, given few context
points the model averages over all possible faces, but as
the number of context pairs increases the predictions capture
image-specific details like face orientation and facial
expression. Furthermore, as the number of context points
increases the variance is shifted towards the edges in the
image.
An important aspect of CNPs demonstrated in Figure 5, is
its flexibility not only in the number of observations and
targets it receives but also with regards to their input values.
It is interesting to compare this property to GPs on one hand,
and to trained generative models (van den Oord et al., 2016;
Gregor et al., 2015) on the other hand.
The first type of flexibility can be seen when conditioning on
subsets that the model has not encountered during training.
Consider conditioning the model on one half of the image,
fox example. This forces the model to not only predict pixel
values according to some stationary smoothness property of
the images, but also according to global spatial properties,
e.g. symmetry and the relative location of different parts of
faces. As seen in the first row of the figure, CNPs are able to
capture those properties. A GP with a stationary kernel cannot
capture this, and in the absence of observations would
revert to its mean (the mean itself can be non-stationary but

usually this would not be enough to capture the interesting
properties).
In addition, the model is flexible with regards to the target
input values. This means, e.g., we can query the model
at resolutions it has not seen during training. We take a
model that has only been trained using pixel coordinates of
a specific resolution, and predict at test time subpixel values
for targets between the original coordinates. As shown in
Figure 5, with one forward pass we can query the model at
different resolutions. While GPs also exhibit this type of
flexibility, it is not the case for trained generative models,
which can only predict values for the pixel coordinates on
which they were trained. In this sense, CNPs capture the best
of both worlds – it is flexible in regards to the conditioning
and prediction task, and has the capacity to extract domain
knowledge from a training set.
We compare CNPs quantitatively to two related models:
kNNs and GPs. As shown in Table 4.2.3 CNPs outperform
the latter when number of context points is small (empirically
when half of the image or less is provided as context).
When the majority of the image is given as context exact
methods like GPs and kNN will perform better. From the table
we can also see that the order in which the context points
are provided is less important for CNPs, since providing the
context points in order from top to bottom still results in
good performance. Both insights point to the fact that CNPs
learn a data-specific ‘prior’ that will generate good samples
even when the number of context points is very small.

== Experimental Result IV: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg|center]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

conditional neural process

2018-11-20T23:27:55Z

S366chen: /* Experimental Result III: Classification */

conditional neural process

2018-11-20T23:27:46Z

S366chen: /* Experimental Result II: Image Completion for Digits */

conditional neural process

2018-11-19T17:44:25Z

S366chen: /* Conclusion */

== Introduction ==

To train a model effectively, deep neural networks require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach : the first phase learns the statistics
of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned.

In their work, they proposed a family of models that represent solutions to the supervised problem, and ab end-to-end training approach to learning them, that combine neural networks with features reminiscent if Gaussian Process. They call this family of models Conditional Neural Processes.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg]]

Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

== Conclusion ==

In this paper they had introduced Conditional Neural Processes,
a model that is both flexible at test time and has the
capacity to extract prior knowledge from training data.

We had demonstrated its ability to perform a variety of tasks
including regression, classification and image completion.
We compared CNPs to Gaussian Processes on one hand, and
deep learning methods on the other, and also discussed the
relation to meta-learning and few-shot learning.
It is important to note that the specific CNP implementations
described here are just simple proofs-of-concept and can
be substantially extended, e.g. by including more elaborate
architectures in line with modern deep learning advances.
To summarize, this work can be seen as a step towards learning
high-level abstractions, one of the grand challenges of
contemporary machine learning. Functions learned by most
Conditional Neural Processes
conventional deep learning models are tied to a specific, constrained
statistical context at any stage of training. A trained
CNP is more general, in that it encapsulates the high-level
statistics of a family of functions. As such it constitutes a
high-level abstraction that can be reused for multiple tasks.
In future work they are going to explore how far these models can
help in tackling the many key machine learning problems
that seem to hinge on abstraction, such as transfer learning,
meta-learning, and data efficiency.

conditional neural process

2018-11-19T17:43:24Z

S366chen: /* Experimental Result III: Classification */

conditional neural process

2018-11-19T17:43:11Z

S366chen: /* Experimental Result III: Classification */

File:008.jpg

2018-11-19T17:42:45Z

S366chen:

conditional neural process

2018-11-19T17:42:34Z

S366chen: /* Experimental Result III: Classification */

== Introduction ==

To train a model effectively, deep neural networks require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach : the first phase learns the statistics
of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned.

In their work, they proposed a family of models that represent solutions to the supervised problem, and ab end-to-end training approach to learning them, that combine neural networks with features reminiscent if Gaussian Process. They call this family of models Conditional Neural Processes.

== Model ==
Consider a data set <math display="inline"> \{x_i, y_i\} </math> with evaluations <math display="inline">y_i = f(x_i) </math> for some unknown function <math display="inline">f</math>. Assume <math display="inline">g</math> is an approximating function of f. The aim is yo minimize the loss between <math display="inline">f</math> and <math display="inline">g</math> on the entire space <math display="inline">X</math>. In practice, the routine is evaluated on a finite set of observations.

Let training set be <math display="inline"> O = \{x_i, y_i\}_{i = 0} ^ n-1</math>, and test set be <math display="inline"> T = \{x_i, y_i\}_{i = n} ^ {n + m - 1}</math>.

P be a probability distribution over functions <math display="inline"> F : X \to Y</math>, formally known as a stochastic process. Thus, P defines a joint distribution over the random variables <math display="inline"> {f(x_i)}_{i = 0} ^{n + m - 1}</math>. Therefore, for <math display="inline"> P(f(x)|O, T)</math>, our task is to predict the output values <math display="inline">f(x_i)</math> for <math display="inline"> x_i \in T</math>, given <math display="inline"> O</math>,

[[File:001.jpg]]

== Conditional Neural Process ==

Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.

CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math>

In detail, we use the following archiecture

<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math>, where <math display="inline">h_\theta : X \times Y \to \mathbb{R} ^ d</math>

<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> and maps them into a single element of <math display="inline">\mathbb{R} ^ d</math>

<math display="inline">\Phi_i = g_\theta</math> for any <math display="inline">x_i \in T</math>, where <math display="inline">g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e</math> and <math display="inline">\Phi_i</math> are parameters for <math display="inline">Q_\theta</math>

Note that this architecture ensures permutation invariance and <math display="inline">O(n + m)</math> scaling for conditional prediction. Also, <math display="inline">r = r_i * r_2 * ... * r_n</math> can be computed in <math display="inline">O(n)</math>, this architecture supports streaming observation with minimal overhead.

We train <math display="inline">Q_\theta</math> by asking it to predict <math display="inline">O</math> conditioned on a randomly
chosen subset of <math display="inline">O</math>. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores <math display="inline">Q_\theta</math> on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling <math display="inline">f</math> and <math display="inline">N</math>.
This approach shifts the burden of imposing prior knowledge

from an analytic prior to empirical data. This has
the advantage of liberating a practitioner from having to
specify an analytic form for the prior, which is ultimately
intended to summarize their empirical experience. Still, we
emphasize that the <math display="inline">Q_\theta</math> are not necessarily a consistent set of
conditionals for all observation sets, and the training routine
does not guarantee that.

In summary,

1. A CNP is a conditional distribution over functions
trained to model the empirical conditional distributions
of functions <math display="inline">f \sim P</math>.

2. A CNP is permutation invariant in <math display="inline">O</math> and <math display="inline">T</math>.

3. A CNP is scalable, achieving a running time complexity
of <math display="inline">O(n + m)</math> for making <math display="inline">m</math> predictions with <math display="inline">n</math>
observations.

== Experimental Result I: Function Regression ==

Classical 1D regression task that used as a common baseline for GP is our first example.
They generated two different datasets that consisted of functions
generated from a GP with an exponential kernel. In the first dataset they used a kernel with fixed parameters, and in the second dataset the function switched at some random point. on the real line between two functions each sampled with
different kernel parameters. At every training step they sampled a curve from the GP, select
a subset of n points as observations, and a subset of t points as target points. Using the model, the observed points are encoded using a three layer MLP encoder h with a 128 dimensional output representation. The representations are aggregated into a single representation
<math display="inline">r = \frac{1}{n} \sum r_i</math>
, which is concatenated to <math display="inline">x_t</math> and passed to a decoder g consisting of a five layer
MLP.

Two examples of the regression results obtained for each
of the datasets are shown in the following figure.

[[File:007.jpg]]

They compared the model to the predictions generated by a GP with the correct
hyperparameters, which constitutes an upper bound on our
performance. Although the prediction generated by the GP
is smoother than the CNP's prediction both for the mean
and variance, the model is able to learn to regress from a few
context points for both the fixed kernels and switching kernels.
As the number of context points grows, the accuracy
of the model improves and the approximated uncertainty
of the model decreases. Crucially, we see the model learns
to estimate its own uncertainty given the observations very
accurately. Nonetheless it provides a good approximation
that increases in accuracy as the number of context points
increases.
Furthermore the model achieves similarly good performance
on the switching kernel task. This type of regression task
is not trivial for GPs whereas in our case we only have to
change the dataset used for training

== Experimental Result II: Image Completion for Digits ==

[[File:002.jpg]]

They also tested CNP on the MNIST dataset and use the test
set to evaluate its performance. As shown in the above figure the
model learns to make good predictions of the underlying
digit even for a small number of context points. Crucially,
when conditioned only on one non-informative context point the model’s prediction corresponds
to the average over all MNIST digits. As the number
of context points increases the predictions become more
similar to the underlying ground truth. This demonstrates
the model’s capacity to extract dataset specific prior knowledge.
It is worth mentioning that even with a complete set
of observations the model does not achieve pixel-perfect
reconstruction, as we have a bottleneck at the representation
level.
Since this implementation of CNP returns factored outputs,
the best prediction it can produce given limited context
information is to average over all possible predictions that
agree with the context. An alternative to this is to add
latent variables in the model such that they can be sampled
conditioned on the context to produce predictions with high
probability in the data distribution.

An important aspect of the model is its ability to estimate
the uncertainty of the prediction. As shown in the bottom
row of the above figure, as they added more observations, the variance
shifts from being almost uniformly spread over the digit
positions to being localized around areas that are specific
to the underlying digit, specifically its edges. Being able to
model the uncertainty given some context can be helpful for
many tasks. One example is active exploration, where the
model has a choice over where to observe.
They tested this by
comparing the predictions of CNP when the observations
are chosen according to uncertainty, versus random pixels. This method is a very simple way of doing active
exploration, but it already produces better prediction results
than selecting the conditioning points at random.

== Experimental Result III: Classification ==
Finally, they applied the model to one-shot classification using the Omniglot dataset. This dataset consists of 1,623 classes
of characters from 50 different alphabets. Each class has
only 20 examples and as such this dataset is particularly
suitable for few-shot learning algorithms. They used 1,200 randomly selected classes as
their training set and the remainder as our testing data set.
This includes cropping
the image from 32 × 32 to 28 × 28, applying small random
translations and rotations to the inputs, and also increasing
the number of classes by rotating every character by 90
degrees and defining that to be a new class. They generated
the labels for an N-way classification task by choosing N
random classes at each training step and arbitrarily assigning
the labels 0, ..., N − 1 to each.

[[File:008.jpg]]
Given that the input points are images, they modified the architecture
of the encoder h to include convolution layers as
mentioned in section 2. In addition they only aggregated over
inputs of the same class by using the information provided
by the input label. The aggregated class-specific representations
are then concatenated to form the final representation.
Given that both the size of the class-specific representations
and the number of classes are constant, the size of the final
representation is still constant and thus the O(n + m)
runtime still holds.
The results of the classification are summarized in the following table
CNPs achieve higher accuracy than models that are significantly
more complex (like MANN). While CNPs do not
beat state of the art for one-shot classification our accuracy
values are comparable. Crucially, they reached those values
using a significantly simpler architecture (three convolutional
layers for the encoder and a three-layer MLP for the
decoder) and with a lower runtime of O(n + m) at test time
as opposed to O(nm)

conditional neural process

2018-11-19T17:37:26Z

S366chen: /* Experimental Result II: Image Completion for Digits */

conditional neural process

2018-11-19T17:36:24Z

S366chen: /* Experimental Result II: Image Completion for Digits */